26 July 2021

Apache access logs contain a lot of information about requests made to websites. This article looks at the information you find in the logs. In the next article I look at how to get specific information from logs.

Log format

The log format used by Apache can be customised but very few administrators do so. When you look at access log entries for your domain you almost certainly find they look something like this:

12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /wp-login.php HTTP/1.1" 404 11138 "-" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"

Even if you have never looked at an access log before you probably understand most of the fields. The line starts with the IP address that made the request, and there is a timestamp between square brackets. After that you get the resource that was requested (“GET /wp-login.php”). You might also notice that Apache returned the status code 404 (“not found”). And, the last field shows that the request came from someone running Firefox 89 on Fedora.

Other fields are less obvious. For instance, the above example has three fields that are empty (“-“) and there is the number 11138 after the status code. Here is an overview of all the bits of information.

12.34.56.78

The IP address of the client (“remote host”).

The identity of the client. This field is rarely used and almost always shows a hyphen.

The user ID of the person who made the requests. This shows a username if you use password-protected directories (i.e. when someone needs to log in to access a resource). The field shows a hyphen if there was no login.

[20/Jul/2021:12:09:30 +0100]

The timestamp for the request. The format of the timestamps is a little awkward, as it is not so easy to process them using scripts. A timestamp in the format YYYY-MM-DD HH:MM:SS would be much easier.

“GET /wp-login.php HTTP/1.1”

This field contains three bits of information: HTTP method (such as GET or POST); the resource that was requested (/wp-login.php) and the protocol used (HTTP/1.1).

404

The status code returned to the client by Apache. In the example entry the status code is 404, which means that the wp-login.php page doesn’t exist.

11138

The number of bytes send back to the client.

“-“

The referrer for the request. This is typically the page that linked to the resource. In this case I pointed my browser straight at example.com/wp-login.php and the field is therefore empty.

“Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0”

The user agent request header. This field contains quite a bit of information. It shows that the request came from someone running Firefox 89 on Fedora Linux, and also reveals that this is a 64-bit operating system running the X11 window manager.

Confusing fields

There are three fields that might need a bit more of an explanation. The first is the user ID field (the third field). If you add password protection to a directory then your browser prompts you for a username and password when you try to access the protected area. If the login is successful then the username is shown in the third field.

The field doesn’t store other login names, such as the username of a WordPress user. However, it is used in cPanel’s access log. When you log into cPanel (or web mail) then the third field shows the login name.

Referrers

Another field worth highlighting is the referrer field. When I tried to access example.com/wp-login.php I got an error 404, as the page doesn’t exist. That was not the only request – there were a dozen or so related ones:

12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /wp-login.php HTTP/1.1" 404 11138 "-" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /modules/system/system.base.css?qsfb62 HTTP/1.1" 200 5428 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /modules/system/system.messages.css?qsfb62 HTTP/1.1" 200 961 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /modules/user/user.css?qsfb62 HTTP/1.1" 200 1827 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /modules/system/system.menus.css?qsfb62 HTTP/1.1" 200 2035 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
12.34.56.78 - - [20/Jul/2021:12:09:30 +0100] "GET /modules/field/theme/field.css?qsfb62 HTTP/1.1" 200 550 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0"
...

When I accessed wp-login.php I got a custom “page not found” page. That page downloads lots of other resources, such as style sheets. All those request have example.com/wp-login.php as the referrer, as that is the resource that triggered the requests.

The referrer can also be an external site, such as a search engine or a website that has linked to the page that was requested. So, you can use that to see what websites have linked to your website. And you probably also quickly learn about referrer spam. That is, other websites might link to your website purely to get listed in analytics tools such as Awstats. If the stats for your website are publicly available then the advertised website is effectively linked from your website, which may (or may not) improve the search engine ranking of the site.

User agents

And finally, it is worth bearing in mind that user agents can be spoofed. Apache records how a user identifies itself, but it doesn’t verify the information. So, I can tell Apache that I use ClownBrowser 0.1.0.1.0.1 on Windows 13:

12.34.56.78 - - [20/Jul/2021:13:26:07 +0100] "GET / HTTP/1.1" 200 8376 "-" "Hosanna/1.1 (Window 13; x128) ClownBrowser 0.1.0.1.0.1"