Last updated:
awk
is a programming language that can easily fill a book. In fact, quite a few books have been written about awk
. This article is a gentle introduction to the language. It doesn’t cover advanced features such as loops but we will show you how to use awk
to filter data in log files.
One of the main reasons why awk
is often used for manipulating structured text files (such as logs) is that it automatically assigns a variable to data elements on a line. The first data field is represented by the variable $1
, the second by $2
, and so forth. There is also a special variable for the last field on a line: $NF
.
By default awk
assumes that data fields are delimited by spaces. However, you can use any other character (or a string of characters) as the field separator. Before we get to that, let’s look at an example.
The below is a snippet from an Apache access log file. We are using grep
to filter lines containing the string POST /wp-login.php
and we are only printing the first three lines (by piping the output to head -3
):
$ grep "POST /wp-login.php" example.net | head -3 192.99.147.77 - - [08/Sep/2020:13:14:17 +0100] "POST /wp-login.php HTTP/1.1" 200 1552 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" 192.99.147.77 - - [08/Sep/2020:13:14:18 +0100] "POST /wp-login.php HTTP/1.1" 200 1529 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" 192.99.147.77 - - [08/Sep/2020:13:14:18 +0100] "POST /wp-login.php HTTP/1.1" 200 1523 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
The IP address is the first field, and is therefore represented by $1
. To print only the first data field you can use awk '{ print $1 }'
:
$ grep "POST /wp-login.php" example.net \ | awk '{ print $1 }' | head -3 192.99.147.77 192.99.147.77 192.99.147.77
Often, a list with IP addresses is piped to sort
and uniq
, to create a count of the number of occurrences:
$ grep "POST /wp-login.php" example.net \ | awk '{ print $1 }' \ | sort | uniq -c | sort -nr | head -3 12 82.199.94.105 8 52.51.225.142 8 5.135.105.44
The result is a sorted list of IP addresses that made POST requests to the wp-login.php file.
As said, awk
assumes that data fields are delimited by spaces. The -F
option lets you change the delimiter. For instance, you can use a comma as the separator by using -F','
.
Apache access logs don’t really use delimiters, or at least not in a consistent way. Some fields are separated by spaces while the timestamp is delimited by square brackets. Other information, such as the user agent, is wrapped in quotes. However, by changing the delimiter we can get whatever information we want to filter.
In Apache access logs the HTTP method (such as GET or POST) is printed after a double-quotes character. This is followed by a space and the resource that was requested. We can therefore get a list of the URLs that were accessed by combining two awk
commands:
$ awk -F'\"' '{ print $2 }' example.net \ | awk '{ print $2 }' \ | sort | uniq -c | sort -nr | head -5 1460 /wp-login.php 259 /xmlrpc.php 31 / 27 /favicon.ico 24 /apple-touch-icon-precomposed.png
The main thing to note in the above command is that we had to escape the double-quotes in the -F
option (-F'\"'
). This is because double quotes have a special meaning in awk
. Without escaping the character awk
expects a closing double-quotes character, and as there is none the command would return an error.
You can extract data from other files in the same way. The below is an extract from a global error_log file. We are grepping for the (case-insensitive) string wordpressbruteforce. The command filters connections that were blocked by LiteSpeed:
$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \ | head -3 2020-09-01 00:05:15.153680 [NOTICE] [10835] [192.99.15.139] bot detected for vhost [APVH_example.net], reason: WordPressBruteForce, close connection! 2020-09-01 00:14:18.391639 [NOTICE] [10835] [195.154.102.244] bot detected for vhost [APVH_example.net], reason: WordPressBruteForce, close connection! 2020-09-01 00:16:30.724135 [NOTICE] [10835] [192.99.15.139] bot detected for vhost [APVH_evilcorp.biz], reason: WordPressBruteForce, close connection!
There are two ways to get a list of IP addresses from the file. The first option is to use awk
to print the fifth field ($5
). That would give us a list with IP address, but we would need to strip the opening and closing brackets. That job can be taken care of by the tr
command:
$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \ | awk '{ print $5 }' \ | tr -d [] \ | sort | uniq -c | sort -nr | head -5 351 192.99.15.139 112 192.99.35.149 88 46.105.127.166 77 46.105.99.212 65 46.105.99.163
The second option uses two awk
commands: one that uses an opening square bracket as the delimiter and a second that uses a closing square bracket:
$ grep -i wordpressbruteforce /usr/local/apache/logs/error_log \ | awk -F '[' '{ print $4 }' \ | awk -F']' '{ print $1 }' \ | sort | uniq -c | sort -nr | head -5 351 192.99.15.139 112 192.99.35.149 88 46.105.127.166 77 46.105.99.212 65 46.105.99.163
The main point to take away from this introduction to awk
is that it is a useful tool to extract fields from structured files, such as logs. And as we have seen it is often used in combination with other utilities, such as grep
, sort
and uniq
.
If this articles has sparked your interest, there is a lot more to awk
then we covered here. To learn more we recommend Bruce Barnett’s awk tutorial and Tutorials Point’s awk guide.