26 July 2021

Extracting specific data from Apache access logs is tricky. The files don’t use a single delimiter, so you can’t simply, say, open an access log in a spreadsheet and filter specific columns. However, with a bit of Awk magic you can report on specific fields.

The delimiter problem

To work with a raw access log you ideally want the file in a format you can import into a spreadsheet. Each field would be in its own column, so that can easily sort and filter specific columns. If you are a spreadsheet wizard then you could even link multiple logs together and use vlookups and other weird and wonderful spreadsheet functions.

Unfortunately, raw access logs don’t use such a format. If you open an access log in a spreadsheet application then you quickly find there is a problem: what delimiters to use to get all the fields in separate columns? You can actually get pretty far by using spaces as the delimiter and double quotes as the string delimiter. For instance, here is how the example log entries I used in the article about understanding access logs is imported:

12.34.56.78	-	-	[20/Jul/2021:12:09:30	+0100]	GET /wp-login.php HTTP/1.1	404	11138	-	Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0

That is not too bad but there are a few issues. The timestamp got split into two separate columns, and there are unwanted square brackets (how are you going to sort entries by date?). Also, the sixth column (“GET /wp-login.php HTTP/1.1”) contains three fields: the HTTP method, resource and protocol. So, to get all 11 fields in their own column and a usable format you need to do a bit more work.

Ultimately, the issue here is that a spreadsheet isn’t the best tool for the job. That is not to say that this approach is wrong. It is fine to work with the tools you are familiar with, even if they aren’t the most suitable. And that is all the more true if the best tool for the job is the Awk programming language.

Awk basics

Awk works a little like the “text to columns” function in spreadsheet applications. By default, it delimits lines in a file by spaces, and each field is assigned a variable. The first field is $1, the second $2 and so forth. It also has many built-in functions for working with data, but let’s first look at a basic example: getting the top IP addresses in an access log.

The IP address is the first field, so you can use Awk to grab the first field on every line and then use other utilities to create a sorted count:

# awk '{print $1}' example.com_ssl-log \
| sort | uniq -c | sort -nr | head
 842  11.22.33.44
  15  99.88.77.66
   3  55.66.77.88
...

To get the top IP addresses for a specific time-period you can often first grep the lines you are interested in and then perform the same Awk magic. For instance, here I produce a count of the top IP addresses between 13:00 and 13:30 on the 25th July 2021:

# grep "25/Jul/2021:13:[0-2]" example.com_ssl-log \
| awk '{print $1}' \
| sort | uniq -c | sort -nr | head
 345  11.22.33.44
  12  99.88.77.66
   3  55.66.77.88
...

And you can even do things like this:

# awk '$9 ~ /5[0-9]{2}/ {print $9, $4, $5, $6, $7}' example.com_ssl-log
503 [25/Jul/2021:13:38:17 +0100] "GET /examples-rock
503 [25/Jul/2021:13:38:21 +0100] "GET /sites/all/themes/example/css/small.css?qsfb62
503 [25/Jul/2021:13:39:43 +0100] "GET /
503 [25/Jul/2021:13:39:50 +0100] "GET /what-is-an-example

Here, Awk checked if the ninth field matches the pattern 5[0-9]{2} (a five followed by two other digits). For any matching lines it prints the status code ($9), timestamp ($4 and $5) and the HTTP method and resource ($6 and $7). In other words, it gives me specific details about any server errors showing in the log.

Advanced Awk

The last example works but it does push basic Awk usage to its limits:

  • To print the timestamp you need to print two fields ($4 and $5), as there is a space between the time and timezone.
  • We also still got the odd Apache timestamp format, including the square brackets. That make it difficult to do more advanced filtering on the timestamps.
  • The HTTP method includes the double quotes. Awk split the fields by spaces, so the double quotes are part of the string. To get rid off the double quotes you need to do more processing.

The log muncher

In short, we have hit the same limit we ran into when we looked at using the “text to columns” function in spreadsheets. To properly delimit the fields you need to perform more wizardry. There are many ways to go about this, but I like to use this log muncher function. It creates a tab-separated file named access_log.tmp with the 11 Apache access log fields and one extra field: the Apache timestamp converted to a Unix timestamp.

The Unix timestamp shows the time as the number of seconds since 1 January 1970 (the “Unix epoch”). This makes it easy to filter all entries between two timestamps. For instance, you can list specific fields for all log entries in the last hour. To do so you first need the current time and the current time minus 3,600 seconds:

# date +%s
1627225164

# echo "$(( $(date +%s) - 3600 ))"
1627221569

Next, you tell Awk to use tabs as the separator (-F'\t') and you then check the Unix timestamps against the fifth field in the file. You can then print whatever fields you are interested in. Here, I print the IP address, the date and the resource that was requested:

# awk -F'\t' '$5 > 1627221569 && $5 < 1627225164 {print $1, $4, $7}' access_log.tmp
99.88.77.66 25/Jul/2021:14:59:43 +0100 /xmlrpc.php
55.44.33.22 25/Jul/2021:15:01:38 +0100 /wp-login.php
...
11.22.33.44 25/Jul/2021:15:37:16 +0100 /wp-login.php
11.22.33.44 25/Jul/2021:15:37:17 +0100 /xmlrpc.php

Managing separators

The above output is separated by spaces. You can use a BEGIN rule to set both the input and output separator. This does make the rest of the command a little less intuitive, as both the if clause and the print statement now need to be inside curly brackets. To make the example more interesting I also added another test: the IP address needs to match 11.22.33.44.

# awk 'BEGIN{FS="\t";OFS="\t"} { if ($1 == "11.22.33.44" && $5 > 1627221569 && $5 < 1627225164) {print $1, $4, $7} }' access_log.tmp
11.22.33.44   25/Jul/2021:15:37:15 +0100      /wp-login.php
11.22.33.44   25/Jul/2021:15:37:16 +0100      /wp-login.php
11.22.33.44   25/Jul/2021:15:37:17 +0100      /xmlrpc.php

I hope this demonstrates how Awk can munch data in any way you like. It is not an easy tool to use, but once you understand the basics it can be very powerful. And of course, you don’t have to do all the processing using Awk. You can safely import the file the log muncher function produces into your favourite spreadsheet application!