Terug naar inhoudsopgave

Tekst manipulatie

Een indrukwekkende hoeveelheid van alle werk op een computer wordt verricht met tekstverwerkers. Ook het werk van programmeurs bestaat niet grotendeels uit analyse, ontwerpen, controle, herontwerpen, enzovoorts, maar vooral uit het aanmaken en aanpassen van teksten in allerlei vorm. Dit geldt vooral nu computers worden gebruikt voor het hele ontwerp proces, vanaf programmeren tot en met documentatie.

Veel van de in dit gedeelte genoemde programma's zijn in vorige delen al zijdeling genoemd (zoals grep en sort), maar ze zullen hier meer uitgebreid aan de orde komen.

Het printen van een bestand

Het printen van een bestand is een veelvoorkomende actie. Hieronder volgen een aantal voorbeelden van het printen van verschillende soorten bestanden.

We bekijken eerst de basis syntax van een print commando:

$lpr -Plinuxlj filename
$

De optie "-P" voor linuxlj is erg belangrijk!!! (lpr heeft een aantal belangrijke en bruikbare opties, waarmee u vertrouwd zou moeten raken. Vandaar dat ik u adviseer  lpr's man pagina eens te bekijken.) De "-Pprinter" optie bepaalt printer als het uitvoer apparaat.

Verbazend genoeg is dat het eigenlijk! Voordat we het stukje over pr gaan bekijken laat ik u nog even een lijstje zien van de manieren om verschillende bestandsformaten te printen:
 

The best way to print html files (that look like they do on the net, not as html code) is to use the print option in your web browser.

Preparing A File For Printing With `pr'

Though you may hear evidence to the contrary, pr is used for preparing a file for printing, or, make a file pretty for printing. (Again, be sure and check out the man page for the list of options.)

Here's an example for the use of pr using a file with a list of names:

$pr people

Jul 8 13:21 1997 people Page 1

Scottie
Michael
Tony
Dennis
Luc
Ron

$

A couple of notes here: 1)after "Ron" there are "alot" of blank lines which are not shown here to save space; 2)note the header which would be attached to every page; 3)the ouput here is to the Standard Output. If you wanted it to go to a printer you will need to pipe it to the lpr command discussed above.

"-t" is one of the more useful options to pr. It suppresses both the header and the trailer (all the blank lines). The "-h" option allow you to specify the heading.

$pr -h Distribution people|lpr -Plinuxlj
$

(Note the pipe to the printer.) Now the printout, instead of having the file name people in the page header, has the word "Distribution". If your title has spaces, remember to use double quotes. Also, if you want to retain the page numbers, but without any title in the header, use an empty string (ie. "") as the argument to "-h".

As you can see, the list of names in people is not very long. But suppose it contained hundreds of names and hence went on for pages. If you were to print the list, apart from wasting paper, it would be awkward to handle. It would be better if there were only one page with several colunms of names on it. To achieve this, we tell pr how many columns we want by preceding the number of columns with the - character which usually introduces options:

$pr -4 -h Distribution lotsofpeople|lpr -Plinuxlj
$

Splitting A File With `split'

As powerful as most commands are, sometimes the enormous size of a file can be too large for them. This means that there are times when a "divide and conquer" principle is used, and the file is split into smaller, more managable chunks. The split utility performs this task. Having split a file into smaller pieces, the pieces can be edited singly, then the pieces can be concatenated into one whole file again with the cat command. (Note that the original version of the file remains intact.)

When a file is split, it is seperated into 1000-line pieces. (The number of lines per chunk can be altered by using the "-number" option.) The output files are given the names `xaa', `xab', ..., `xzz'. (You can "replace" x with a name by appending that name to the end of the command line for split.)

One very useful application of split is to print individual pages of a file. Suppose you had obtained a printout of a large file, say 100 pages or so, then discovered some minor errors in the sixth and tenth pages. After correcting the file, you don't necessarily want to print the whole thing again, just the pages with the errors. Here's an example of how to accomplish this (take note of the syntax for split since this is our first example):

$split bigfile page
$lpr -Plinuxlj pageaf pageaj
$

The sixth page is the file `pageaf' and the tenth page is the file `pageaj'. You will need to greatly reduce the size of a page using the "-number" option. I'm not sure exactly which size is best (the only example I've seen used "-66") so you may want to experiment a little.

And of course, to get the smaller (presumably edited) files back together use:

$cat page?? > bigfile
$

You may want to use a new file instead of bigfile in case an error occurs and you lose your original file. Also, remember the significance of the "??" from the Wild-card section. And don't forget to remove the files that resulted from split, when you are done with them.

A Short Section On `sort'

sort is a command that we have mentioned briefly before. In short, it "sorts stuff". It is also a command that has alot of options. I'll mention a few useful(?) ones here, but be sure and check out the man page for the rest.

Let's use our list of names (ie. people) file that has come up now and then in examples. I've done a little preperation and added last names and numbers to the list. So each line in people has the format "firstname lastname number". sort "sorts" by using "fields" seperated by tabs or spaces in your file (this is convient, since you don't have to put you file in a columnar layout). So if we just use sort the default is to use the first field for sorting (ie. people will be sorted by first name). You can alter which field to sort by using the "+number" and/or "-number" options. (So to sort by last name you would use sort +1 people.) One little problem here is if two people have the same last name: the order of the first name is not sorted. Easily solved:sort +1 -2 people. That is, skip one field before you start sorting (first names), then stop sorting after the second field (last names). When sortstops sorting after the second field, it then resumes sorting from the beginning of the line again, so it will sort on the first names. Here's our list of sorted names (have you figured out yet where the names are from?):

$sort +1 -2 people
Ron Harper 9
Michael Jordan 23
Toni Kukoc 7
Luc Longley 13
Scottie Pippen 33
Dennis Rodman 91
$

Just a couple of notes on some options before you head to the next section. If there happen to be two occurences of the same line (probally a typo, or when files are merged) you will not(?) want the name to appear twice. Use the "-u" (for unique) option. When sorting numeric fields, use the "n" option (usually combined with a letter, ala, "+2n" to sort the above by number). This tells sort to skip any blanks that precede the numbers. (Not very relevent here, but if the list were columnized there would be many blanks.) Use the "r" option to sort in reverse order. Finally, if you have a group of sorted files (make sure they are already sorted!) that you want to merge together, you can use the "-m" option. Redirection will be used for saving any sorted list, or for saving merged files. For example:

$sort -m +1 -2 people morepeople > everybody
$

Counting Things With `wc'

The wc command counts the number of lines, words, and characters in a file. This is pretty straight forward, so let's pick through an example:

$wc people

       7        18        95 people
$

wc tells us that the people file has 7 lines, 18 words and 95 characters (including newlines). You can specifiy either of the three counts (in any combination) by using the "-l" (line), "-w" (word), or "-c" (character), options. You can also count multiple files, and when wc displays its results, it gives a count for each file specified and also a cummulative total for each category of the specified files.

One neat trick with wc is to check and see how many people are on the system.

$who | wc

    39       195      1794
$

So there are 39 users on the system.

Finding Text Patterns In A File With `grep'

grep is another program we have mentioned. It has a wide number of uses and options, and I will try and do it justice here. However, I'd recommend viewing the man page (as usual) and perhaps a more in depth users guide.

grep is a utility program which searches a file, or more than one file, for lines which contain strings of a certain pattern. Such lines are said to match the pattern. Lines which match the specified pattern are printed to Standard Output.

In its simplest use, grep just looks for a pattern which consists of a fixed character string. (It is possible, however, to describe more complex patterns, called "regular expressions". More on this shortly.) For example:

$grep Pippen people
Scottie Pippen 33
$

What we'd expect. If you use a multiple file search, the output will be prefixed by "filename:". If you wanted to search for the person's whole name, make sure to put it in quotes. This is because the pattern you're looking for must form one argument to grep. Otherwise, if you used the command grep Scottie Pippen people, you would get an error message from grep as it tryed unsucessfully to open the non-existent file Pippen. Also, if for some reason, you wish to print out everything but the line you specify, use the "-v" (for invert) option. Use the "-n" option to give the line numbers of any matches (very useful). If you give the "-c" option, the matching lines are not returned. Instead, a count of the number of lines that match the pattern in each file is shown. "-l" cleans up output if you are only looking for filenames with matching lines appearing within. Finally (for now), you can use the "-y" option to tell grep that the given string may have a different "casing" then listed in the command line. For example, grep -y pippen people, will find "Pippen" in the file.

Regular Expressions In Text Patterns

So far we have only given grep a fixed character string to look for, but it is capable of more complex searches. We can give grep a pattern (or template) of the text we want to search for.

Such a pattern or template is called a "regular expression" and the name of the command derives from that. grep stands for "global regular expression printer".

Regular expressions work in a way similar to the Shell's filematching capibility. Certain characters have a special meaning. These special characters are called "metacharacters" because they represent something other than themselves. Because many of the characters which have special meanings in regular expressions also have special meaning to the UNIX Shell, it is best to enclose the regular expression in quotes. Single quotes ( ' ) are safest, but often double quotes ( " ) are sufficicent.

I'm going to skim quickly over some examples and recommend that you experiment on your own. Alot. In your experimenting, use multiple metacharaters in a single regular expression for some nice results. Two of the simplest metacharacters to use are the circumflex ^ and the dollar sign $, which match the beginning of a line and the end of a line, respectively.

The period (or "dot" (sound familiar?)) is a metacharacter which matches any character at all. Characters enclosed in brackets, [,], specify a set of characters that are to be searched for. The match is on any one of the characters inside the brackets. A number enclosed in braces { } following an expression specifies the number of times that the preceding expression is to be repeated. So our search for four letter words could be expressed:

" [Dd][a-z]{3} "

This repeat number specification is known as a "closure".

The general format of the closure is {n,m}, where n is the minimum number of repeats and m is the maximum number of repeats. A missing n is assumed to be one, and a missing m is assumed to be infinity (or at least huge).

There are shorthand ways of expressing some closures:

Sick of grep yet? I thought so. Just one last thing of interest that I picked up: a way to search through trees of files for a particular regular expression. A (too?) powerful search engine that you can use on your own files (or system files):

I call this script 'forall'. Use it like this:

$forall /usr/include grep -i expression
$forall /usr/man grep expression
Here's forall:
     #!/bin/sh
     if [ 1 = `expr 2 \> $#` ]
     then
             echo Usage: $0 dir cmd [optargs]
             exit 1
     fi
     dir=$1
     shift
     find $dir -type f -print | xargs "$@"
Just copy this into a file, make sure the file is executable, and away you go.

Translating Character Strings With `tr'

The utility program trtranslates (or transliterates) characters in a file. tr works on the Standard Input. If you want to take input from a file, you have to redirect the Standard Input so that it comes from a file.

tr can take two arguments which specify character sets. Each member of the first set is replaced by the equivalent member of the second set. To give a crazy example:

$ tr a-z zyxwvutsrqponmlkjihgfedcba < people
Rlm Hzikvi 9
Mrxszvo Jliwzm 23
Tlmr Kfplx 7
Lfx Llmtovb 13
Sxlggrv Prkkvm 33
Dvmmrh Rlwnzm 91
$

We have reversed the alphabet for the lowercase letters, uppercase letters are not affected because we didn't include them in our character set.

And that's really about it. (There are a few other options. The most important of which is "-d" (for delete). If, for example, we wanted to get rid of the numbers in our people file we could use the following command:tr -d 0-9 < people. This writes to Standard Output. To save the changes use redirection.)

Comparing And Contrasting With `diff', `cmp' And `comm'

A fairly common occurence is that there are several different versions of a file around at various stages of development. When this arises, it is important to be able, at any time, to get an answer to the question:"How does the latest version of this file differ from the previous version of the file?" You can use several different utilities to compare files.

First, we will look at diff. diff can display the differences between two text files. diff finds all differences, so if you change the spacing on a line, or remove spaces from the end of a line, these will show up as differences (unless you use the "-b" option). The name diff comes from "differential file comparator".

For an example let's use the ever trusty people file. I've created a new file called people.new that has changed the first line to "Jason Caffey 35"; changed Jordan's number to 45; and completely removed Rodman. Use diff to find the differences. (The order of filename arguments on the diff command line is important. We'll see why in a minute.)

$diff people.new people
1,2c1,2
< Jason Caffey 35
< Michael Jordan 45
---
> Ron Harper 9
> Michael Jordan 23
5a6
> Dennis Rodman 91
$

It seems to have found the changes, but if we didn't know what they were in advance, how would we decode diff's output? Look at the first line of output. It says that lines 1 and 2 of people.new have been changed from people, but they are still numbered lines 1 and 2. (You'll get the hang of it after awhile.) It then lists the two lines from each file. The "<" indicates that the line is from the first file on the command line (people.new) and ">" indicates the second argument. The last change is that a new line has been added after line five of the people.new. The new line is line six from the people file. (Actually, the line was deleted from the people file. If we were to reverse the order of the arguments to diff, the first differences would be reported the same, but the last would be reported as:

6d5
< Dennis Rodman 91

ie. line 6 of the first file has been deleted.)

Here is a short summary of the meaning of diff's results. There are only three ways in which diff indicates a change to a file:

Another utility program which can be used to find differences between two files is cmp, for compare. While diff looks for lines that are different, cmp just does a byte-by-byte (character-by character for text files) comparison of the two files you specify:

$ cmp people.new people
people.new people differ: char 1, line 1
$

As soon as cmp finds one byte that is different between the two files, it prints out a message as shown in the example, and stops. If you want to see all the differences you need to use the "-l" (for long) option. However, this can get pretty long even for short files, as cmp prints out every byte that is different in the two files. Hence, cmp in not very suitable for showing the differences between text files, it is more suited for program object and data files. However, it does provide a quick way to find out whether files are different or not. If the files are different, you can then use diff to get the details.

The diff and cmp commands answer the question "what is different about these two files?". We now discuss a command that answers the question "what is the same about these two files?".

The comm command prints lines that are common to two files.

$comm people people.new

        Jason Caffey 35
        Michael Jordan 45
Ron Harper 9
Michael Jordan 23
                Toni Kukoc 7
                Luc Longley 13
                Scottie Pippen 33
        
Dennis Rodman 91
$
The comm utility produces three columns. Unfrotunately they overlap, so they aren't too easy to read.

The first column shows lines that are in the first of the files you specified, but not in the second. The second column shows lines that are in the second file, but not in the first. The third column shows lines that appear in both files.

If you don't want to see all this information, you can supress either the first, second or third column by using the "-1", "-2" or "-3" options respectively. Note that if you supress the first column, the second and third columns shift to the left, and so.

Things That Make You Say `awk'

Are you ready to be confused?


Uh...no?
 

Well, I apoligize in advance. We'll be discussing the awk command in this section. Maybe it's not too bad, but it can be intimidating to beginners, especially non-programmers. So without further ado: da da da dot, da daaa...awk

awk is another text selection and alteration tool in the same family as grep. In addition to providing a means to search for text patterns, awk extends the capabilites to selecting specific fields from lines and testing relationships between those fields. awk can be thought of as a "programmable report-generator".

At its simplest, what awk does is to select a line (from a file) according to some selection criteria. The selection criteria can be text patterns (regular expressions) as in grep and other utilities. Having found lines of interest, awk can then perform some actions on the line, or portions of the line.

This selection-action process is represented in awk notation by:

pattern {action}
and means, for every record (line) which matches the specified pattern, perform the specified action.

Both the pattern and the action are optional. A pattern with no corresponding action simply selects the matched record for display on the Standard Output. An action with no associated pattern is performed on all records in the file. In other words, a missing pattern matches all lines in the file.

An awk pattern is specified by enclosing it in slashes:

/pattern/
The pattern can be regular expressions as described in grep and others.

Let's start with an easy(?) example. We're going to take our people file, and print (to Standard Output) the names with last name first.

$awk '{print $2 ", " $1 " " $3}' people
Harper, Ron 9
Jordan, Michael 23
Kukoc, Toni 7
Longley, Luc 13
Pippen, Scottie 33
Rodman, Dennis 91
$

This is a very simple awk program, with an action but no pattern. The action part (enclosed in the braces) specifies that awk is to print the fields in the order indicated. awk considers every record in a file to be composed of fields. Fields are seperated by field seperators, which are normally spaces or tabs, but can be changed to whatever you like. Fields are accessed by using the $n notation. Also, notice that we had to specifically insert spaces or tab characters in the example, by enclosing them in quoted strings. If you just typed the reference fields, all the output would be scrunched together. (If you use such a conversion often, you may want to stick the action in a file and then use the "-f" switch. (ie. $awk -f swap people, where swap contains:{print $2 ", " $1 " " $3})).

awk has two special "built-in" patterns, called BEGIN and END. If BEGIN appears as a pattern, it matches the beginning of a file, so that you can gain control before any other processing is done. Similarily, the pattern END matches the end of the file so you can gain control when you get to the end of the file.

The next example shows how END is used to calculate the total batting average for our softball team. First of all, we create an awk program in a file called average:

$cat > average
{total = total + $3}
END {print "Team batting average is ", total / NR}
^D
$

Now, the batting averages are tabulated in the file softball, in a similar format to the people file (ie. firstname lastname average). And we compute the team average with awk, like this:

$awk -f average softball
Team batting average is 0.3625
$

Not too shaby.

In this simple awk program we have introduced quite a few new features. The first line of the program adds the value of the third field of each record (line) of the data file to the variable called "total". When a variable (such as "total" in the example) is first mentioned, awk creates it and sets its initial value to zero.

The second line in our awk program indicates what has to be done when the end of the data file is reached (specified by the pattern "END"). First we print a message "Team batting average is", then on the same line we print the value of "total" divided by "NR". The variable "NR" is built in to awk. Its value is always equal to the Number of Records (lines) in the file.

An awk pattern can in fact be a conditional expression, not just a simple character string. For example, if we only want the people in people with a number greater than 20, use:

$awk '$3 >=20 {print $0}' people
Michael Jordan 23
Scottie Pippen 33
Dennis Rodman 91
$

($0 is special: it refers to the whole record.) The symbol >= means "greater than or equal to". Try also: <= (less than or equal to);<,> (strickly less than, greater than, respectively);== (equal to); != (not equal to).

You could also specify a pattern range. If you want only the people who's numbers fall between 23 and 33 try (you will need the numbers to be sorted to get the correct results. Sort of. If there is more than one occurance of the final value in the range, awk only outputs the first one it encounters.):

$awk '{print $3 " "$1 " "$2} people > temp
$sort -n temp >temp2
$awk '/23/,/33/ {print $0}' temp2
23 Michael Jordan
33 Scottie Pippen
$

As you can see, awk is a powerful and wide ranging utility. If you still have some questions about awk, you may want to consult The GNU Awk User's Guide