Regular Expressions

The basic principles behind regex

Regular expressions (regex) are a language to tell a program how to look for certain patterns in strings (text data).

Writing Regular Expressions is hard!

I almost always have to look up how to specify a regular expression.

And then I still almost always get it wrong…

Tip

Here’s a few ideas to help you:

check whether your regular expression works using this interactive online tool: https://regexr.com/
Consult this helpful cheatsheet: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf
ask ChatGPT to write the regex for you (regex are largely language agnostic, but still probably good to include R in the prompt)

The building blocs of regular expressions

Character classes

These allow you to match specific sets of characters, for example “[a-z]” matches any lowercase character, “[A-z]” matches both lower and upper case.

# Match any lowercase letter
pattern <- "[a-z]"
str <- "Hello World"
str_view(str, pattern, html = TRUE)

Character classes

You can define your own classes using the [] notation. But there’s also quite a few pre-defined character classes to choose from, including:

R regex	What matches
\w	Any word character (any letter, digit, or underscore)
\W	Any non-word character
\d	Any digit
\D	Any non-digit
\s	Any space character (a space, a tab, a new line, etc.)
\S	Any non-space character

Character classes

To make things more confusing, there’s also other ways to specify classes.

R regex	What matches
[:alpha:]	Any letter
[:lower:]	Any lowercase letter
[:upper:]	Any uppercase letter
[:digit:]	Any digit (equivalent to \d)
[:alnum:]	Any letter or number
[:word:]	Any letter or number, as well as underscores
[:xdigit:]	Any hexadecimal digit
[:punct:]	Any punctuation character
[:graph:]	Any letter, number, or punctuation character
[:space:]	A space, a tab, a new line, etc. (equivalent to \s)

Example

# Match any A-z 0-9 and _
pattern <- "[[:word:]]"
str <- c("Hello World &!&!&1234")
str_view(str, pattern, html = TRUE)

# Match any space character
# Note that you need two "\" characters
# otherwise you'll get an error: "unrecognized escape in character string"
pattern <- "\\s"
str <- "Hello World"
str_view(str, pattern, html = TRUE)

Metacharacters

These are special characters with special meanings in regular expressions, including “.”, “-”, “\”, “^”, “[]”, “()”, or “$”

# Match any character
pattern <- "."
str <- "Hello World"
str_view(str, pattern, html = TRUE)

# Match first character
pattern <- "^H"
str <- "Hello Hello"
str_view(str, pattern, html = TRUE)

# Match last character
pattern <- "d$"
str <- "World World"
str_view(str, pattern, html = TRUE)

Anchors

These allow you to specify where a match should occur within a string, e.g. at the start or end of the string.

# Match the start of a string
pattern <- "^Hello"
str <- c("Hello World", "World Hello")
str_view(str, pattern, html = TRUE, match = NA)

# Match the start of a string
pattern <- "Hello$"
str_view(str, pattern, html = TRUE, match = NA)

Quantifiers

These specify the number of times a pattern should repeat. Examples include

“*” (zero or more occurrences)
“+” (one or more occurrences)
“?” (zero or one occurrence)

# Match zero or more occurrences of "o" following a Y
pattern <- "Yo*"
str <- c("Yo World", "Yoooooo Wlrd", "Y world")
str_view(str, pattern, html = TRUE, match = NA)

# Match one or more occurrences of "o" following a Y
pattern <- "Yo+"
str <- c("Yo World", "Yoooooo Wlrd", "Y world")
str_view(str, pattern, html = TRUE, match = NA)

# Match zero or one occurrence of "u" 
pattern <- "colou?r"
str <- c("colour", "color", "colouur")
str_view(str, pattern, html = TRUE, match = NA)

Grouping and capturing

Parentheses are used to group patterns together and capture specific parts of a match. This is useful for extracting specific information from a string.

# Match and capture the word after "Hello"
pattern <- "Hello (\\w+)"
str <- c("Hello World", "Hello you", "Hello +1234")
str_view(str, pattern, html = TRUE, match = NA)

# Access the groups
str_match_all(str, pattern)

[[1]]
     [,1]          [,2]   
[1,] "Hello World" "World"

[[2]]
     [,1]        [,2] 
[1,] "Hello you" "you"

[[3]]
     [,1] [,2]

Grouping

Grouping can also be used together with quantifiers to specify that a certain pattern will repeat

# look for repetitions of the pattern "ab"

pattern <- "(ab)+"
str <- "abba ababababa aaab"
str_view(str, pattern, html = TRUE, match = NA)

# compare this to:
pattern <- "[ab]+"
str_view(str, pattern, html = TRUE, match = NA)

Alternation

The pipe symbol “|” allows you to specify multiple alternative patterns. For example, “cat|dog” matches either “cat” or “dog”.

# Match either "cat" or "dog"
pattern <- "cat|dog"
str <- c("I have a cat", "I have a dog", "I have a dog and a cat")

str_view(str, pattern, html = TRUE, match = NA)

Escape sequences

These are used to match special metacharacters that would otherwise have a special meaning in regular expressions. For example, to match a literal dot, you need to escape it as “\\.”

# Match a literal dot
pattern <- "\\."
str <- "Hello. World"

str_view(str, pattern, html = TRUE, match = NA)

Lookarounds

These are used to perform lookahead and lookbehind assertions. They allow you to match patterns based on what comes before or after the current position without including it in the final match.

# Match "cat" followed by "s"
pattern <- "cat(?=s)"
str <- c("cats", "cat", "caterpillars")
str_view(str, pattern, html = TRUE, match = NA)

Greedy vs. non-greedy matching

By default, regular expressions are greedy, meaning they match as much as possible. Adding a “?” after a quantifier makes it non-greedy, matching as little as possible.

# Greedy matching: match as much as possible
pattern <- "a.*b"
str <- "an abnormally long sentence that ends with b"
str_view(str, regex(pattern), html = TRUE, match = NA)

# Non-greedy matching: match as little as possible
pattern <- "a.*?b"
str_view(str, regex(pattern), html = TRUE, match = NA)

Modifiers

Regular expressions often have modifiers that can change their behavior. Common modifiers include “i” (case-insensitive matching) and “g” (global matching).

Most R functions allow you to specify to ignore case as an argument

# Case-insensitive matching with argument
pattern <- "hello"
str <- "Hello World"
str_view(str, regex(pattern, ignore_case = TRUE), html = TRUE, match = NA)

# Case-insensitive matching regex style
pattern <- "(?i)hello"
str <- "Hello World"
str_view(str, regex(pattern), html = TRUE, match = NA)

How to use regex in your daily R use?

When working with text data of course…

Use the great {stringr} package to do all sorts of things with strings..

str_remove()
str_replace()
str_extract()
str_detect()
str_split()

..but also for selecting columns using tidyselect functions :)

mtcars %>% 
  select(matches("[t$]")) %>% 
  head

..and when using pivot_ , or separate_ function families

head(who)

names(who)

 [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
 [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
[11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
[16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
[21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
[26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
[31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
[36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
[41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
[46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
[51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
[56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

who %>% pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = c("diagnosis", "gender", "age"),
  names_pattern = "new_?(.*)_(.)(.*)",
  values_to = "count"
)

Exercise

Use the inbuilt sentences dataset (comes with the stringr package) and do the following:

Use str_view() to find all sentences that start with the definite article “The”. Tip: be sure that you don’t accidentally capture sentences that start with “They”, or “These”…

str_view(sentences, pattern = "")

Use str_view() to find all sentences that begin with a pronoun (He, She, It, They).

str_view(sentences, pattern = "")

Use the words dataset and str_detect() to only get those words that are a colour. Tip, the colours() function prints a nice list of colours ;)

colors()

words[str_detect(words, pattern)]

Resources

Write regex code interactively

Helpful regex guides: