Last updated: 29 April 2021

awk is a programming language that can easily fill a book. In fact, quite a few books have been written about awk. This article is a gentle introduction to the language. It doesn’t cover advanced features such as loops but we will show you how to use awk to filter data in log files.

Data element variables

One of the main reasons why awk is often used for manipulating structured text files (such as logs) is that it automatically assigns a variable to data elements on a line. The first data field is represented by the variable $1, the second by $2, and so forth. There is also a special variable for the last field on a line: $NF.

By default awk assumes that data fields are delimited by spaces. However, you can use any other character (or a string of characters) as the field separator. Before we get to that, let’s look at an example.

Getting IP addresses from a log file

The below is a snippet from an Apache access log file. We are using grep to filter lines containing the string POST /wp-login.php and we are only printing the first three lines (by piping the output to head -3):

$ grep "POST /wp-login.php" | head -3 - - [08/Sep/2020:13:14:17 +0100] "POST /wp-login.php HTTP/1.1" 200 1552 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" - - [08/Sep/2020:13:14:18 +0100] "POST /wp-login.php HTTP/1.1" 200 1529 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" - - [08/Sep/2020:13:14:18 +0100] "POST /wp-login.php HTTP/1.1" 200 1523 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"

The IP address is the first field, and is therefore represented by $1. To print only the first data field you can use awk '{ print $1 }':

$ grep "POST /wp-login.php" \
| awk '{ print $1 }' | head -3

Often, a list with IP addresses is piped to sort and uniq, to create a count of the number of occurrences:

$ grep "POST /wp-login.php" \
| awk '{ print $1 }' \
| sort | uniq -c | sort -nr | head -3

The result is a sorted list of IP addresses that made POST requests to the wp-login.php file.

The delimiter

As said, awk assumes that data fields are delimited by spaces. The -F option lets you change the delimiter. For instance, you can use a comma as the separator by using -F','.

Apache access logs don’t really use delimiters, or at least not in a consistent way. Some fields are separated by spaces while the timestamp is delimited by square brackets. Other information, such as the user agent, is wrapped in quotes. However, by changing the delimiter we can get whatever information we want to filter.

Delimiting by quotes

In Apache access logs the HTTP method (such as GET or POST) is printed after a double-quotes character. This is followed by a space and the resource that was requested. We can therefore get a list of the URLs that were accessed by combining two awk commands:

  • The first command uses double quotes as the delimiter and prints the second field.
  • The second command uses the default delimiter (a space) and also prints the second field.
$ awk -F'\"' '{ print $2 }' \
| awk '{ print $2 }' \
| sort | uniq -c | sort -nr | head -5
   1460 /wp-login.php
    259 /xmlrpc.php
     31 /
     27 /favicon.ico
     24 /apple-touch-icon-precomposed.png

The main thing to note in the above command is that we had to escape the double-quotes in the -F option (-F'\"'). This is because double quotes have a special meaning in awk. Without escaping the character awk expects a closing double-quotes character, and as there is none the command would return an error.

Delimiting by square brackets

You can extract data from other files in the same way. The below is an extract from a global error_log file. We are grepping for the (case-insensitive) string wordpressbruteforce. The command filters connections that were blocked by LiteSpeed:

$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \
| head -3
2020-09-01 00:05:15.153680 [NOTICE] [10835] [] bot detected for vhost [], reason: WordPressBruteForce, close connection!
2020-09-01 00:14:18.391639 [NOTICE] [10835] [] bot detected for vhost [], reason: WordPressBruteForce, close connection!
2020-09-01 00:16:30.724135 [NOTICE] [10835] [] bot detected for vhost [], reason: WordPressBruteForce, close connection!

There are two ways to get a list of IP addresses from the file. The first option is to use awk to print the fifth field ($5). That would give us a list with IP address, but we would need to strip the opening and closing brackets. That job can be taken care of by the tr command:

$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \
| awk '{ print $5 }' \
| tr -d [] \
| sort | uniq -c | sort -nr | head -5

The second option uses two awk commands: one that uses an opening square bracket as the delimiter and a second that uses a closing square bracket:

$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \
| awk -F '[' '{ print $4 }' \
| awk -F']' '{ print $1 }' \
| sort | uniq -c | sort -nr | head -5

More information

The main point to take away from this introduction to awk is that it is a useful tool to extract fields from structured files, such as logs. And as we have seen it is often used in combination with other utilities, such as grep, sort and uniq.

If this articles has sparked your interest, there is a lot more to awk then we covered here. To learn more we recommend Bruce Barnett’s awk tutorial and Tutorials Point’s awk guide.