SHELLdorado Newsletter 1/2005

SHELLdorado Newsletter 1/2005 - April 30th, 2005

================================================================
The "SHELLdorado Newsletter" covers UNIX shell script related
topics. To subscribe to this newsletter, leave your e-mail
address at the SHELLdorado home page:

        http://www.shelldorado.com/

View previous issues at the following location:

        http://www.shelldorado.com/newsletter/

"Heiner's SHELLdorado" is a place for UNIX shell script
programmers providing

     Many shell script examples, shell scripting tips & tricks,
     a large collection of shell-related links & more...
================================================================

Contents

 o  Shell Tip: How to read a file line-by-line
 o  Shell Tip: Print a line from a file given its line number
 o  Shell Tip: How to convert upper-case file names to lower-case
 o  Shell Tip: Speeding up scripts using "xargs"
 o  Shell Tip: How to avoid "Argument list too long" errors

-----------------------------------------------------------------
>> Shell Tip: How to read a file line-by-line
-----------------------------------------------------------------

    Assume you have a large text file, and want to process it
    line-by line. How could you do it?

	file=/etc/motd
    	for line in `cat $file`	# WRONG
	do
	    echo "$line"
	done

    ...is no solution, because the variable "line" will in turn
    contain each (whitespace-delimited) *word* of the file, not
    each line. The "while" command is a better candidate for
    this job:

	file=/etc/motd
    	while read line
	do
	    echo "$line"
	done < "$file"

    Note that the "read" command automatically processes its
    input: it removes leading whitespace from each line, and
    concatenates a line ending with "\" with the one following.
    The following commands suppress this behaviour:

    	file=/etc/motd
	OIFS=$IFS; IFS=		# Change input field separator
	while read -r line
	do
	    echo "$line"
	done < "$file"
	IFS=$OIFS		# Restore old value

    There still is one disadvantage to this loop: it's slow. If
    the processing consists of string manipulations, consider
    replacing the loop completely e.g. with an AWK script.
    
    Portability:
     	"read -r" is available with ksh, ksh93, bash, zsh,
	POSIX, but not with older Bourne Shells (sh).


-----------------------------------------------------------------
>> Shell Tip: Print a line from a file given its line number
-----------------------------------------------------------------

    Regular expressions can be very powerful, and there are many
    tools (like "egrep") allowing to use them on any file.  But
    what if we simply want to get the 5th line of a file? No
    elaborate regular expression required:

	lineno=5
    	sed -n "${lineno}p"

    prints the 5th line without involving ^.[]*$ or other
    meta-characters resembling the noise of a defective serial
    interface. "sed -n" means: do not automatically print each
    line. "5p" indicates: print line 5. We have to use
    "${lineno}p" here instead of "$linenop", because otherwise
    the shell would try to expand the variable "$linenop", not
    knowing that "p" is an "sed" command.

    This could be improved upon. Assume the input file is
    "/usr/dict/words",  and consists of 25143 lines. The "sed"
    command above would not only dutifully print line 5, but
    also continue to read the following 25138 lines, doing what
    it was told to do: ignore them. The following command makes
    "sed" stop reading after line 5:

    	lineno=5
	sed -n "${lineno}{p;q;}"

    So you think you have a better solution for this problem?
    Prove it: send me your suggestion
    (heiner.steven@shelldorado.com, closing date: 2005-05-31),
    and I'll measure the speed of all contributions on a Linux
    and a Solaris system. The fastest (or most elegant) solution
    using only POSIX shell commands will be published in the
    next SHELLdorado Newsletter.


-----------------------------------------------------------------
>> Shell Tip: How to convert upper-case file names to lower-case
-----------------------------------------------------------------

    Admit it: you sometimes copy files from an  operating system
    with a name ending in *indows. A frequent annoyance are file
    names IN ALL UPPER CASE.

    The following command renames them to contain only lower
    case characters:

    	for file in *
	do
	    lcase=`echo "$file" | tr '[A-Z]' '[a-z]'`

	    # Does the target file exist already? Do not
	    # overwrite it:
	    [ -f "$lcase" ] && continue

	    # Are old and new name different?
	    [ x"$file" = x"$lcase" ] && continue # no change

	    mv "$file" "$lcase"
	done

    The KornShell (and ksh93) has the useful "typeset -l"
    option, which will automatically convert the contents of a
    variable to lower case:

    	$ typeset -l lcase=ABCDE
	$ echo "$lcase"
	abcde

    Changing the above loop to use "typeset -l" is left as an
    exercise for the reader.


-----------------------------------------------------------------
>> Shell Tip: Speeding up scripts using "xargs"
-----------------------------------------------------------------

    The essential part of writing fast scripts is avoiding
    external processes.

    	for file in *.txt
	do
	    gzip "$file"
	done

    is much slower than just

    	gzip *.txt

    because the former code may need many "gzip" processes for a
    task the latter command accomplishes with only one external
    process.  But how could we build a command line like the one
    above when the input files come from a file, or even
    standard input? A naive approach could be

    	gzip `cat textfiles.list archivefiles.list`

    but this command can easily run into an "Argument list too
    long" error, and doesn't work with file names containing
    embedded whitespace characters. A better solution is using
    "xargs":

    	cat textfiles.list archivefiles.list | xargs gzip

    The "xargs" command reads its input line by line, and build
    a command line by appending each line to its arguments
    (here: "gzip"). Therefore the input

    	a.txt
	b.txt
	c.txt

    would result in "xargs" executing  the command
    
    	gzip a.txt b.txt c.txt
    	
    "xargs" also takes care that the resulting command line does
    not get too long, and therefore avoids "Argument list too
    long" errors.


-----------------------------------------------------------------
>> Shell Tip: How to avoid "Argument list too long" errors
-----------------------------------------------------------------

    Oh no, there it is again: the system's spool directory is
    almost full (4018 files); old files need to be removed, and
    all useful commands only print the dreaded "Argument list
    too long":

    	$ cd /var/spool/data
	$ ls *
	ls: Argument list too long
	$ rm *
	rm: Argument list too long

    So what exactly in the character '*' is too long? Well, the
    current shell does the useful work of converting '*' to a
    (large) list of files matching that pattern. This is not the
    problem. Afterwards, it tries to execute the command (e.g.
    "/bin/ls") with the file list using the system call
    execve(2) (or a similar one). This system call has a
    limitation for the maximum number of bytes that can be used
    for arguments and environment variables(*), and fails.
    
    It's important to note that the limitation is on the side of
    the the system call, not the shell's internal lists.

    To work around this problem, we'll use shell-internal
    functions, or ways to limit the number of files directly
    specified as arguments to a command.

    Examples:

     o  Don't specify arguments, to get the (hopefully) useful
        default:

    	$ ls

     o  Use shell-internal functionality ("echo" and "for" are
        shell-internal commands):

	$ echo *
	file1 file2 [...]

	$ for file in *; do rm "$file"; done	# be careful!

     o  Use "xargs"

	$ ls | xargs rm		# careful!

	$ find . -type f -size +100000 -print | xargs ...

     o  Limit the number of arguments for a command:

     	$ ls [a-l]*
	$ ls [m-z]*

    Using this techniques should help getting around the
    problem.

    ---
    (*) Parameter ARG_MAX, often 128K (Linux) or 1 or 2 MB
        (Solaris).


----------------------------------------------------------------
If you want to comment on this newsletter, have suggestions for
new topics to be covered in one of the next issues, or even want
to submit an article of your own, send an e-mail to

        mailto:heiner.steven@shelldorado.com

================================================================
To unsubscribe, send a mail with the body "unsubscribe" to
newsletter@shelldorado.com
================================================================