My favorite Unix tool

Awk is a little language designed for the processing of lines of text. It is available on every Unix (since V3) or Linux system. The name is an acronym of the names of its creators: Aho, Weinberger and Kernighan.

Since I spent a couple of minutes to learn awk I have found it quite useful during my daily work. It is my favorite tool in the base set of Unix tools due to its simplicity and versatility.

Typical use cases for awk scripts are log file analysis and the processing of character separated value (CSV) formats. Awk allows you to easily filter, transform and aggregate lines of text.

The idea of awk is very simple. An awk script consists of a number of patterns, each associated with a block of code that gets executed for an input line if the pattern matches:

pattern_1 {
    # code to execute if pattern matches line
}

pattern_2 {
    # code to execute if pattern matches line
}

# ...

pattern_n {
    # code to execute if pattern matches line
}

Patterns and blocks

The patterns are usually regular expressions:

/error|warning/ {
    # executed for each line, which contains
    # the word "error" or "warning"
}

/^Exception/ {
    # executed for each line starting
    # with "Exception"
}

There are some special patterns, namely the empty pattern, which matches every line …

{
    # executed for every line
}

… and the BEGIN and END patterns. Their blocks are executed before and after the processing of the input, respectively:

BEGIN {
    # executed before any input is processed,
    # often used to initialize variables
}

END {
    # executed after all input has been processed,
    # often used to output an aggregation of
    # collected values or a summary
}

Output and variables

The most common operation within a block is the print statement. The following awk script outputs each line containing the string “error”:

/error/ { print }

This is basically the functionality of the Unix grep command, which is filtering. It gets more interesting with variables. Awk provides a couple of useful built-in variables. Here are some of them:

  • $0 represents the entire current line
  • $1$n represent the 1…n-th field of the current line
  • NF holds the number of fields in the current line
  • NR holds the number of the current line (“record”)

By default awk interprets whitespace sequences (spaces and tabs) as field separators. However, this can be changed by setting the FS variable (“field separator”).

The following script outputs the second field for each line:

{ print $2 }

Input:

John 32 male
Jane 45 female
Richard 73 male

Output:

32
45
73

And this script calculates the sum and the average of the second fields:

{
    sum += $2
}

END {
    print "sum: " sum ", average: " sum/NR
}

Output:

sum: 150, average: 50

The language

The language that can be used within a block of code is based on C syntax without types and is very similar to JavaScript. All the familiar control structures like if/else, for, while, do and operators like =, ==, >, &&, ||, ++, +=, … are there.

Semicolons at the end of statements are optional, like in JavaScript. Comments start with a #, not with //.

Variables do not have to be declared before usage (no ‘var’ or type). You can simply assign a value to a variable and it comes into existence.

String concatenation does not have an explicit operator like “+”. Strings and variables are concatenated by placing them next to each other:

"Hello " name ", how are you?"
# This is wrong: "Hello" + name + ", how are you?"

print is a statement, not a function. Parentheses around its parameter list are optional.

Functions

Awk provides a small set of built-in functions. Some of them are:

length(string), substr(string, index, count), index(string, substring), tolower(string), toupper(string), match(string, regexp).

User-defined functions look like JavaScript functions:

function min(number1, number2) {
    if (number1 < number2) {
        return number1
    }
    return number2
}

In fact, JavaScript adopted the function keyword from awk. User-defined functions can be placed outside of pattern blocks.

Command-line invocation

An awk script can be either read from a script file with the -f option:

$ awk -f myscript.awk data.txt

… or it can be supplied in-line within single quotes:

$ awk '{sum+=$2} END {print "sum: " sum " avg: " sum/NR}' data.txt

Conclusion

I hope this short introduction helped you add awk to your toolbox if you weren’t familiar with awk yet. Awk is a neat alternative to full-blown scripting languages like Python and Perl for simple text processing tasks.

One thought on “My favorite Unix tool

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s