ref: 2b99422480d596ebc26921c87c6bb81a07949f3e
dir: /ch9.ms.old/
.so tmacs .BC 9 "More tools .BS 2 "Regular expressions .LP We have used .CW sed to replace one string with another. But, what happens here? .P1 ; echo foo.xcc | sed 's/.cc/.c/g' foo..c ; echo focca.x | sed 's/.cc/.c/g' f.ca.x .P2 .LP We need to learn more. .PP In addresses of the form .CW /text/ and in commands like .CW s/text/other/ , the string .CW text is not a string for .CW sed . This happens to many other programs that search for things. For example, we have used .CW grep to print only lines containing a string. Well, the .I string given to grep, like in .P1 ; grep string file1 file2 ... .P2 .LP is .I not a string. It is a .B "regular expression" . A regular expression is a little language. It is very useful to master it, because many commands employ regular expressions to let you do complex things in an easy way. .PP The text in a regular expression represents many different strings. You have already seen something similar. The .CW *.c in the shell, used for globbing, is very similar to a regular expression. Although it has a slightly different meaning. But you know that in the shell, .CW *.c .B matches with many different strings. In this case, those that are file names in the current directory that happen to terminate with the characters “\f(CW.c\fP”. That is what regular expressions, or .I regexps , are for. They are used to select or match text, expressing the kind of text to be selected in a simple way. They are a language on their own. A regular expression, as known by .CW sed , .CW grep , and many others, is best defined recursively, as follows. .IP • Any single character .I matches the string consisting of that character. For example, .CW a matches .CW a , but not .CW b . .IP • A single dot, “\f(CW.\fP”, matches .I any single character. For example, “\f(CW.\fP” matches .CW a and .CW b , but not .CW ab . .IP • A set of characters, specified by writing a string within brackets, like .CW [abc123] , matches .I any character in the string. This example would match .CW a , .CW b , or .CW 3 , but not .CW x . A set of characters, but starting with .CW ^ , matches any character .I not in the set. For example, .CW [^abc123] matches .CW x , but not .CW 1 , which is in the string that follows the .CW ^ . A range may be used, like in .CW [a-z0-9] , which matches any single character that is a letter or a digit. .IP • A single .CW ^ , matches the start of the text. And a single .CW $ , matches the end of the text. Depending on the program using the regexp, the text may be a line or a file. For example, when using .CW grep , .CW a matches the character .CW a at .I any place. However, .CW ^a matches .CW a only when it is the first character in a line, and .CW ^a$ also requires it to be the last character in the line. .IP • Two regular expressions concatenated match any text matching the first regexp followed by any text matching the second. This is more hard to say than it is to understand. The expression .CW abc matches .CW abc because .CW a matches .CW a , .CW b matches .CW b , and so on. The expression .CW [a-z]x matches any two characters where the first one matches .CW [a-z] , and the second one is an .CW x . .IP • Adding a .CW * after a regular expression, matches zero or any number of strings that match the expression. For example, .CW x* matches the empty string, and also .CW x , .CW xx , .CW xxx , etc. Beware, .CW ab* matches .CW a , .CW ab , .CW abb , etc. But it does .I not match .CW abab . The .CW * applies to the preceding regexp, with is just .CW b in this case. .IP • Adding a .CW + after a regular expression, matches one or more strings that match the previous regexp. It is like .CW * , but there has to be at least one match. For example, .CW x+ does not match the empty string, but it matches every other thing matched by .CW x* . .IP • Adding a .CW ? after a regular expression, matches either the empty string or one string matching the expression. For example, .CW x? matches .CW x and the empty string. This is used to make parts optional. .IP • Different expressions may be surounded by parenthesis, to alter grouping. For example, .CW (ab)+ matches .CW ab , .CW abab , etc. .IP • Two expressions separated by .CW | match anything matched either by the first, or the second regexp. For example, .CW ab|xy matches .CW ab , and .CW xy . .IP • A backslash removes the special meaning for any character used for syntax. This is called a .I escape character. For example, .CW ( is not a well-formed regular expression, but .CW \e( is, and matches the string .CW ( . To use a backslash as a plain character, and not as a escape, use the backslash to escape itself, like in .CW \e\e . .PP That was a long list, but it is easy to learn regular expressions just by using them. First, let's fix the ones we used in the last section. This is what happen to us. .P1 ; echo foo.xcc | sed 's/.cc/.c/g' foo..c ; echo focca.x | sed 's/.cc/.c/g' f.ca.x .P2 .LP But we wanted to replace .CW .cc , and not .I any character and a .CW cc . Now we know that the first argument to the .CW sed command .CW s , is a regular expression. We can try to fix our problem. .P1 ; echo foo.xcc | sed 's/\e.cc/.c/g' foo.xcc ; echo focca.x | sed 's/\e.cc/.c/g' focca.x .P2 .LP It seems to work. The backslash removes the special meaning for the dot, and makes it match just one dot. But this may still happen. .P1 ; echo foo.cc.x | sed 's/\e.cc/.c/g' foo.c.x .P2 .LP And we wanted to replace only the extension for file names ending in .CW .cc . We can modify our expression to match .CW .cc only when immediately before the end of the line (which is the string being matched here). .P1 ; echo foo.cc.x | sed 's/\e.cc$/.c/g' foo.cc.x ; echo foo.x.cc | sed 's/\e.cc/.c/g' foo.x.c .P2 .LP Sometimes, it is useful to be able to refer to text that matched part of a regular expression. Suppose you want to replace the variable name .CW text with .CW word in a program. You might try with .CW s/text/word/g , but it would change other identifiers, which is not what you want. .P1 ; cat f.c void printtext(char* text) { print("[%s]", text); } ; sed 's/text/word/g' f.c void printword(char* word) { print("[%s]", word); } .P2 .LP The change is only to be done if .CW word is not surounded by characters that may be part of an identifier in the program. For simplicity, we will assume that these characters are just .CW [a-z0-9_] . We can do what follows. .P1 ; sed 's/([^a-z0-9_])text([^a-z0-9_])/\e1word\e2/g' f.c void printtext(char* word) { print("[%s]", word); } .P2 .LP The regular expression .CW [^a-z0-9_]text[^a-z0-9_] means “any character that may not be part of an identifier”, then .CW text , and then “any character that may not be part of an identifier”. Because the substitution affects .I all the regular expression, we need to substitute the matched string with another one that has .CW word instead of .CW text , but keeping the characters matching .CW [^a-z0-9_] before and after the string .CW text . This can be done by surounding in parentheses both .CW [^a-z0-9_] . Later, in the destination string, we may use .CW \e1 to refer to the text matching the first regexp within parenthesis, and .CW \e2 to refer to the second. .PP Because .CW printtext is not matched by .CW [^a-z0-9_]text[^a-z0-9_] , it was untouched. However, “\f(CW␣text)\fP” was matched. In the destination string, .CW \e1 was a white space, because that is what matched the first parenthesized part. And .CW \e2 was a right parenthesis, because that is what matched the second one. As a result, we left those characters untouched, and used them as .I context to determine when to do the substitution. .PP Regular expressions permit to clean up source files in an easy way. In may cases, it makes no sense to keep white space at the end of lines. This removes them. .P1 ; sed 's/[ \t]*//' .P2 .LP We saw that a script .CW t+ can be used to indent text in Acme. Here it is. .P1 ; cat /bin/t+ #!/bin/rc sed 's/^/\t/' ; .P2 .LP This other script removes one level of indentation. .P1 ; cat /bin/t- #!/bin/rc sed 's/^\t//' ; .P2 .LP There are many other useful uses of regular expressions, as you will be able to see from here to the end of this book. In many cases, your C programs can be made more flexible by accepting regular expressions for certain parameters instead of mere strings. For example, an editor might accept a regular expression that determines if the text is to be shown using a .CW "constant width font" or a .I "proportional width font" . For file names matching, say .CW .*\e.[ch] , it could use a constant width font, to show C source code like you might be acustomed to see it. .PP It turns out that it is .I trivial to use regular expressions in a C program, by using the .CW regexp library. The expression is .I compiled into a description more amenable to the machine, and the resulting data structure (called a .CW Reprog ) can be used for matching strings against the expression. This program accepts a regular expression as a parameter, and then reads one line at a time. For each such line, it reports if the string read matches the regular expression or not. .so src/match.c.ms .LP The call to .CW regcomp .I compiles the regular expression into .CW prog . Later, .CW regexec .I executes the compiled regular expression to determine if it matches the string just read in .CW buf . The parameter .CW sub points to an array of structures that keeps information about the match. The whole string matching starts at the character pointed to by .CW sub[0].sp and terminates right before the one pointed to by .CW sub[0].ep . Other entries in the array report which substring matched the first parenthesized expression in the regexp, .CW sub[1] , which one matched the second one, .CW sub[2] , etc. They are similar to .CW \e1 , .CW \e2 , etc. This is an example session with the program. .P1 ; 8.out '*.c' \fRThe * needs something on the left!\fP regerror: missing operand for * ; 8.match '\e.[123]' !!x123 no match !!.123 matched: '.1' !!x.z no match !!x.3 matched: '.3' .P2 .BS 2 "Sorting and searching .LP One of the most useful task achieved with a few shell commands is inspecting the system to find out things. In what follows we are going to learn how to do this, using several assorted examples. .PP Running out of disk space? It is not likely, given the big disks we have today. But anyway, which ones are the biggest files you have created at your home directory? .PP The command .CW du reports disk usage, measured in disk blocks. A disk block is usually 8 or 16 Kbytes, depending on your file system. Although .CW "du -a" reports the size in blocks for each file, it is a burden to scan by yourself through the whole list of files to search for the biggest one. The command .CW sort is used to sort lines of text, according to some criteria. We can ask .CW sort to sort the output of .CW du in numerically decreasing order (biggest numbers first) and then use .CW sed to print just the first few lines. Those ones correspond to the biggest files, which we are interested in. .P1 ; du -a bin | sort -nr | sed 15q 4211 bin 3085 bin/arm 864 bin/arm/enc 834 bin/386 333 bin/arm/madplay 320 bin/arm/madmix 319 bin/arm/deco 316 bin/386/minimad 316 bin/arm/minimad 280 bin/arm/mp3 266 bin/386/minisync 258 bin/rc 212 bin/arm/calc 181 bin/arm/mpg123 146 bin/386/r2bib ; .P2 .LP This includes directories as well, but point us quickly to files like .CW bin/arm/enc that seem to occupy 864 disk blocks! .PP But in any case, if the disk is filling up, it is a good idea to locate the users that created files (or added data to them), to alert them. The flag .CW -m for .CW ls lists the user name that last modified the file. We may collect user names for all the files in the disk, and then notify them. We are going to play with commands until we complete our task, using .CW sed to print just a few lines until we know how to process all the information. The first step is to use the output of .CW du as the initial data, the list of files. If we remove everything up to the file names, we obtain a list of files to work with. .P1 ; du -a bin | sed 's/.* //' | sed 3q bin/386/minimad bin/386/minisync bin/386/r2bib .P2 .LP Now we want to list the user who modified each file. We can change our data to produce the commands that do that, and send them to a shell. .P1 ; du -a bin | sed 's/.* //' | sed 's/^/ls -m /' | sed 3q ls -m bin/386/minimad ls -m bin/386/minisync ls -m bin/386/r2bib ; ; du -a bin | sed 's/.* //' | sed 's/^/ls -m /' | sed 3q | rc [nemo] bin/386/minimad [none] bin/386/minisync [nemo] bin/386/r2bib ; .P2 .LP We still have to work a little bit more. And our command line is growing. Being able to edit the text at any place in a Rio window does help, but it can be convenient to define a .B "shell function" that encapsulates what we have done so far. A shell function is like a function in any other language. The difference is that a shell function receives arguments as any other command, in the command line. Besides, a shell function has command lines in its body, which is not a surprise. Defining a function for what we have done so far can save some typing in the near future. Furthermore, the command we have just built, to list all the files within a given directory, is useful by itself. .P1 ; fn lr { ;; du -a $1 | sed 's/.* //' | sed 's/^/ls -m /' | rc ;; } ; .P2 .LP This defined a function, named .CW lr , that executes exactly the command line we developed. Rc stores the function definition using an environment variable. Thus, most things said for environment variables apply for functions as well (e.g., think about .CW "rfork e" ). .P1 ; cat /env/'fn#lr' fn lr {du -a $1|sed 's/.* //'|sed 's/^/ls -m /'|rc} ; .P2 .LP In the function, we removed the .CW "sed 3q" because it is not reasonable for a function listing all files recursively to stop after listing three of them. If we want to play, we can always add a final .CW sed in a pipeline. Arguments given to the function are accessed like they would be in a shell script. The difference is that the function is executed by the shell where we invoque it, and not by a child shell. By the way, it is preferable to create useful commands by creating ia shell, functions can not be edited as scripts, and are not automatically shared among all shells like files are. Functions are handy to make modular scripts. Let's try our new function. .P1 ; lr bin [nemo] bin/386/minimad [none] bin/386/minisync [nemo] bin/386/r2bib [nemo] bin/386/rc2bin .I "...and many other lines of output..." ; .P2 .LP To obtain our list of users, we may remove everything but the user name. .P1 ; lr bin | sed 's/.([a-z0-9]+).*/\e1/' | sed 3q nemo none nemo ; .P2 .LP And now, to get a list of users, we must drop dupplicates. The program .CW uniq knows how to do it, it reads lines and prints them, lines showing up more than once in the input are printed once. This program needs an input with sorted lines. Therefore, we do what we just did, and sort the lines and remove dupplicate ones. .P1 ; lr bin | sed 's/.([a-z0-9]+).*/\e1/' | sort | uniq esoriano nemo none ; .P2 .LP Note that we removed .CW "sed 3q" from the pipeline, because this command does what we wanted to do and we want to process the whole file tree, and not just the first three ones. .PP To do something different, we can revisit the first example in the last chapter, finding function definitions. This script does just that, if we follow the style convention for declaring functions that was shown at the beginning of this chapter. First, we try to use .CW grep to print just the source line where the function .CW cat is defined in the file .CW /sys/src/cmd/cat.c . Our first try is this. .P1 ; grep cat /sys/src/cmd/cat.c cat(int f, char *s) argv0 = "cat"; cat(0, "<stdin>"); cat(f, argv[i]); .P2 .LP Which is not too helpful. All the lines contain the string .CW cat , but we want only the lines where .CW cat is at the beggining of line, followed by an open parenthesis. Second attempt. .P1 ; grep '^cat\e(' /sys/src/cmd/cat.c cat(int f, char *s) .P2 .LP At least, this prints just the line of interest to us. However, it is useful to get the file name and line number before the text in the line. That output can be used to point an editor to that particular file and line number. Because .CW grep prints the file name when more than one file is given, we could use .CW /dev/null as a second file where to search for the line. It would not be there, but it would make .CW grep print the file name. .P1 ; grep '^cat\e(' /sys/src/cmd/cat.c /dev/null /sys/src/cmd/cat.c:cat(int f, char *s) .P2 .LP Giving the option .CW -n to .CW grep makes it print the line number. Now we can really search for functions, like we do next. .P1 ; grep -n '^cat\e(' /sys/src/cmd/*.c /sys/src/cmd/cat.c:5: cat(int f, char *s) .P2 .LP And because this seems useful, we can package it as a shell script. It accepts as arguments the names for functions to be located. The command .CW grep is used to search for such functions at all the source files in the current directory. .P1 #!/bin/rc rfork e for (f in $*) grep -n '^'$f'\e(' *.[cCh] .P2 .LP How can we use .CW grep to search for .CW -n ? If we try, .CW grep would get confussed, thinking that we are supplying an option. To avoid this, the .CW -e option tells .CW grep that what follows is a regexp to search for. .P1 ; cat text Hi there How can we grep for -n? Who knows! ; grep -n text ; grep -e -n text how can we grep for -n? .P2 .LP This program has other useful options. For example, if may want to locate lines in the file for a chapter of this book where we mention figures. However, if the word .CW figure is in the middle of a sentence it would be all lowercase. When it is starting a sentece, it would be capitalized. We must search both for .CW Figure and .CW figure. The flag .CW -i makes .CW grep become case-insensitive. All the text read is converted to lowercase before matching the expression. .P1 ; grep -i figure ch1.ms Each window shows a file or the output of commands. Figure figure are understood by acme itself. For commands shown in the figure would be .I "...and other matching lines .P2 .LP A popular searching task is determining if a file containing a mail message is spam or not. Today, it would not work, because spammers employ heavy armoring, and even send their text encoded in multiple images sent as HTML mail. However, it was popular to see if a mail message contained certain expressions, if it did, it was considered spam. Because there will be many expressions, we may keep them in a file. The option .CW -f for .CW grep takes as an argument a file containing all the expressions to search for. .P1 ; cat patterns Make money fast! Earn 10+ millions (Take|use) viagra for a (better|best) life. ; if (grep -i -f patterns $mailfile ) echo $mailfile is spam .P2 .LP A different kind of search is looking for differences. Consider that you have mounted at .CW /n/sources/plan9 the Plan 9 distribution file server, known as .I sources . You Plan 9 is likely to be installed by pulling changes from sources and updating your file tree. Nevertheless, you might want to be sure that a particular part of your file tree is exactly the same as it is in sources. .PP There are several tools that can be used to compare files. We saw .CW cmp , but it compares only two files. Besides, .CW cmp does not give much information, because it is meant to compare files that are binary and not textual, and the program reports just which one is the first byte that makes the files different. However, there is another tool, .CW diff , that is just what we need. For example, we can compare .PP Show diff, to search for differences. See the script bdiff .BS 2 "AWK .PP A simple calculator. Pickup some examples from your bin .BS 2 "More complex things .PP Show the scripts to replace strings. Show the script to automatically add users from a list of people. .PP tal vez seria mejor una seccion en procesar ficheros con campos. y ver alli sort uniq y demas. O a lo mejor no. falta por ver awk, esa seccion es una oportunidad para usar alli mismo sort uniq y demas. tail -f Show eval. Use walk as an example. Show doctype also. must talk about other file systems, plumber, cdfs, tarfs, upasfs, replica the dump .PP