Regular expressions (regex) are a language to tell a program how to look for certain patterns in strings (text data).
I almost always have to look up how to specify a regular expression.
And then I still almost always get it wrong…
Tip
Here’s a few ideas to help you:
check whether your regular expression works using this interactive online tool: https://regexr.com/
Consult this helpful cheatsheet: https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf
ask ChatGPT to write the regex for you (regex are largely language agnostic, but still probably good to include R in the prompt)
These allow you to match specific sets of characters, for example “[a-z]” matches any lowercase character, “[A-z]” matches both lower and upper case.
You can define your own classes using the [] notation. But there’s also quite a few pre-defined character classes to choose from, including:
R regex | What matches |
---|---|
\w | Any word character (any letter, digit, or underscore) |
\W | Any non-word character |
\d | Any digit |
\D | Any non-digit |
\s | Any space character (a space, a tab, a new line, etc.) |
\S | Any non-space character |
To make things more confusing, there’s also other ways to specify classes.
R regex | What matches |
---|---|
[:alpha:] | Any letter |
[:lower:] | Any lowercase letter |
[:upper:] | Any uppercase letter |
[:digit:] | Any digit (equivalent to \d) |
[:alnum:] | Any letter or number |
[:word:] | Any letter or number, as well as underscores |
[:xdigit:] | Any hexadecimal digit |
[:punct:] | Any punctuation character |
[:graph:] | Any letter, number, or punctuation character |
[:space:] | A space, a tab, a new line, etc. (equivalent to \s) |
These are special characters with special meanings in regular expressions, including “.”, “-”, “\”, “^”, “[]”, “()”, or “$”
These allow you to specify where a match should occur within a string, e.g. at the start or end of the string.
These specify the number of times a pattern should repeat. Examples include
Parentheses are used to group patterns together and capture specific parts of a match. This is useful for extracting specific information from a string.
Grouping can also be used together with quantifiers to specify that a certain pattern will repeat
The pipe symbol “|” allows you to specify multiple alternative patterns. For example, “cat|dog” matches either “cat” or “dog”.
These are used to match special metacharacters that would otherwise have a special meaning in regular expressions. For example, to match a literal dot, you need to escape it as “\\.”
These are used to perform lookahead and lookbehind assertions. They allow you to match patterns based on what comes before or after the current position without including it in the final match.
By default, regular expressions are greedy, meaning they match as much as possible. Adding a “?” after a quantifier makes it non-greedy, matching as little as possible.
Regular expressions often have modifiers that can change their behavior. Common modifiers include “i” (case-insensitive matching) and “g” (global matching).
Most R functions allow you to specify to ignore case as an argument
Use the great {stringr} package to do all sorts of things with strings..
str_remove()
str_replace()
str_extract()
str_detect()
str_split()
[1] "country" "iso2" "iso3" "year" "new_sp_m014"
[6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
[11] "new_sp_m65" "new_sp_f014" "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
[16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65" "new_sn_m014" "new_sn_m1524"
[21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"
[26] "new_sn_f014" "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
[31] "new_sn_f5564" "new_sn_f65" "new_ep_m014" "new_ep_m1524" "new_ep_m2534"
[36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014"
[41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
[46] "new_ep_f65" "newrel_m014" "newrel_m1524" "newrel_m2534" "newrel_m3544"
[51] "newrel_m4554" "newrel_m5564" "newrel_m65" "newrel_f014" "newrel_f1524"
[56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
Use the inbuilt sentences
dataset (comes with the stringr package) and do the following:
str_view()
to find all sentences that start with the definite article “The”. Tip: be sure that you don’t accidentally capture sentences that start with “They”, or “These”…str_view()
to find all sentences that begin with a pronoun (He, She, It, They).words
dataset and str_detect() to only get those words that are a colour. Tip, the colours()
function prints a nice list of colours ;)Try regex interactively: https://regexr.com/
Write regex code interactively
Helpful regex guides: