SHELLdorado Newsletter 1/2005 - April 30th, 2005 ================================================================ The "SHELLdorado Newsletter" covers UNIX shell script related topics. To subscribe to this newsletter, leave your e-mail address at the SHELLdorado home page: http://www.shelldorado.com/ View previous issues at the following location: http://www.shelldorado.com/newsletter/ "Heiner's SHELLdorado" is a place for UNIX shell script programmers providing Many shell script examples, shell scripting tips & tricks, a large collection of shell-related links & more... ================================================================ Contents o Shell Tip: How to read a file line-by-line o Shell Tip: Print a line from a file given its line number o Shell Tip: How to convert upper-case file names to lower-case o Shell Tip: Speeding up scripts using "xargs" o Shell Tip: How to avoid "Argument list too long" errors ----------------------------------------------------------------- >> Shell Tip: How to read a file line-by-line ----------------------------------------------------------------- Assume you have a large text file, and want to process it line-by line. How could you do it? file=/etc/motd for line in `cat $file` # WRONG do echo "$line" done ...is no solution, because the variable "line" will in turn contain each (whitespace-delimited) *word* of the file, not each line. The "while" command is a better candidate for this job: file=/etc/motd while read line do echo "$line" done < "$file" Note that the "read" command automatically processes its input: it removes leading whitespace from each line, and concatenates a line ending with "\" with the one following. The following commands suppress this behaviour: file=/etc/motd OIFS=$IFS; IFS= # Change input field separator while read -r line do echo "$line" done < "$file" IFS=$OIFS # Restore old value There still is one disadvantage to this loop: it's slow. If the processing consists of string manipulations, consider replacing the loop completely e.g. with an AWK script. Portability: "read -r" is available with ksh, ksh93, bash, zsh, POSIX, but not with older Bourne Shells (sh). ----------------------------------------------------------------- >> Shell Tip: Print a line from a file given its line number ----------------------------------------------------------------- Regular expressions can be very powerful, and there are many tools (like "egrep") allowing to use them on any file. But what if we simply want to get the 5th line of a file? No elaborate regular expression required: lineno=5 sed -n "${lineno}p" prints the 5th line without involving ^.[]*$ or other meta-characters resembling the noise of a defective serial interface. "sed -n" means: do not automatically print each line. "5p" indicates: print line 5. We have to use "${lineno}p" here instead of "$linenop", because otherwise the shell would try to expand the variable "$linenop", not knowing that "p" is an "sed" command. This could be improved upon. Assume the input file is "/usr/dict/words", and consists of 25143 lines. The "sed" command above would not only dutifully print line 5, but also continue to read the following 25138 lines, doing what it was told to do: ignore them. The following command makes "sed" stop reading after line 5: lineno=5 sed -n "${lineno}{p;q;}" So you think you have a better solution for this problem? Prove it: send me your suggestion (heiner.steven@shelldorado.com, closing date: 2005-05-31), and I'll measure the speed of all contributions on a Linux and a Solaris system. The fastest (or most elegant) solution using only POSIX shell commands will be published in the next SHELLdorado Newsletter. ----------------------------------------------------------------- >> Shell Tip: How to convert upper-case file names to lower-case ----------------------------------------------------------------- Admit it: you sometimes copy files from an operating system with a name ending in *indows. A frequent annoyance are file names IN ALL UPPER CASE. The following command renames them to contain only lower case characters: for file in * do lcase=`echo "$file" | tr '[A-Z]' '[a-z]'` # Does the target file exist already? Do not # overwrite it: [ -f "$lcase" ] && continue # Are old and new name different? [ x"$file" = x"$lcase" ] && continue # no change mv "$file" "$lcase" done The KornShell (and ksh93) has the useful "typeset -l" option, which will automatically convert the contents of a variable to lower case: $ typeset -l lcase=ABCDE $ echo "$lcase" abcde Changing the above loop to use "typeset -l" is left as an exercise for the reader. ----------------------------------------------------------------- >> Shell Tip: Speeding up scripts using "xargs" ----------------------------------------------------------------- The essential part of writing fast scripts is avoiding external processes. for file in *.txt do gzip "$file" done is much slower than just gzip *.txt because the former code may need many "gzip" processes for a task the latter command accomplishes with only one external process. But how could we build a command line like the one above when the input files come from a file, or even standard input? A naive approach could be gzip `cat textfiles.list archivefiles.list` but this command can easily run into an "Argument list too long" error, and doesn't work with file names containing embedded whitespace characters. A better solution is using "xargs": cat textfiles.list archivefiles.list | xargs gzip The "xargs" command reads its input line by line, and build a command line by appending each line to its arguments (here: "gzip"). Therefore the input a.txt b.txt c.txt would result in "xargs" executing the command gzip a.txt b.txt c.txt "xargs" also takes care that the resulting command line does not get too long, and therefore avoids "Argument list too long" errors. ----------------------------------------------------------------- >> Shell Tip: How to avoid "Argument list too long" errors ----------------------------------------------------------------- Oh no, there it is again: the system's spool directory is almost full (4018 files); old files need to be removed, and all useful commands only print the dreaded "Argument list too long": $ cd /var/spool/data $ ls * ls: Argument list too long $ rm * rm: Argument list too long So what exactly in the character '*' is too long? Well, the current shell does the useful work of converting '*' to a (large) list of files matching that pattern. This is not the problem. Afterwards, it tries to execute the command (e.g. "/bin/ls") with the file list using the system call execve(2) (or a similar one). This system call has a limitation for the maximum number of bytes that can be used for arguments and environment variables(*), and fails. It's important to note that the limitation is on the side of the the system call, not the shell's internal lists. To work around this problem, we'll use shell-internal functions, or ways to limit the number of files directly specified as arguments to a command. Examples: o Don't specify arguments, to get the (hopefully) useful default: $ ls o Use shell-internal functionality ("echo" and "for" are shell-internal commands): $ echo * file1 file2 [...] $ for file in *; do rm "$file"; done # be careful! o Use "xargs" $ ls | xargs rm # careful! $ find . -type f -size +100000 -print | xargs ... o Limit the number of arguments for a command: $ ls [a-l]* $ ls [m-z]* Using this techniques should help getting around the problem. --- (*) Parameter ARG_MAX, often 128K (Linux) or 1 or 2 MB (Solaris). ---------------------------------------------------------------- If you want to comment on this newsletter, have suggestions for new topics to be covered in one of the next issues, or even want to submit an article of your own, send an e-mail to mailto:heiner.steven@shelldorado.com ================================================================ To unsubscribe, send a mail with the body "unsubscribe" to newsletter@shelldorado.com ================================================================