Advanced Tidyverse Tips & Tricks

Tidy data principles

What is tidy data?

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

Non-Tidy data example

chocolate <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')

head(chocolate) %>% DT::datatable(class = "pagedtable-not-empty")

This data is non-tidy because different observations are encoded in the same column (e.g. the ingredients).

Non-Tidy data example (.cont)

classroom <- tribble(
  ~assessment, ~Billy, ~Suzy, ~Lionel, ~Jenny,
  "quiz1",     NA,     "F",   "B",     "A",
  "quiz2",     "D",    NA,    "C",     "A",
  "test1",     "C",    NA,    "B",     "B"
  )

classroom

Tidy data example (.cont)

Making the classroom data tidy

classroom2 <- classroom %>% 
  pivot_longer(Billy:Jenny, names_to = "student_name", values_to = "grade") %>% 
  arrange(student_name, assessment)
classroom2

Into the tidyverse

Tidyselection

Tidy selection helps you select the columns you want for many different operations in smart ways.

  • contains()
  • starts_with()
  • ends_with()
  • everything()
  • last_col()

Tidyselection example

chocolate %>% 
  select(starts_with("comp"))

Tidyselection

# also works with generic regular expressions
# this is the same as starts_with("comp")
chocolate %>% 
  select(matches("^comp"))

Column selection based on characater vectors

  • all_of()
  • any_of()
my_column_vector <- c("column_name1", "column_name2")

data %>% 
  select(all_of(my_column_vector))

Renaming columns while selecting

A neat trick is to rename a column when selecting it

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
mtcars %>% 
  select(miles_per_gallon = mpg) 

Reordering columns while selecting

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
mtcars %>% 
  select(cyl, disp, mpg, everything()) %>% 
  names
 [1] "cyl"  "disp" "mpg"  "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

everything() is a neat function to quickly select every column

A quick aside: Reordering

Although you can reorder columns using select(), if you only want to reorder things, the relocate() function works better.

df <- tibble(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df %>% relocate(f)
df %>% relocate(a, .after = c)
df %>% relocate(f, .before = b)
df %>% relocate(a, .after = last_col())

Across()

Across()

Across allows you to do things (mutate/summarise/…) to multiple columns at the same time. You can combine this with tidyselection helpers :)

iris %>%
  mutate(across(starts_with("Sepal"),
                round)) %>%
  head()

Across()

You can also select columns based on other criteria, when combined with where()

iris %>% 
  mutate(across(.cols = where(is.numeric),
                round)) %>% 
  head()

Across()

You can also specify multiple functions to do multiple things at once.

iris %>%
  group_by(Species) %>% 
  summarise(across(starts_with("Sepal"),
                list(mean = mean,
                     sd = sd),
                .names = "{col}_{fn}"))

What about additional function arguments?

Provide an anonymous function, using the purrr style lambda ~, or the shorthand \(x)

iris %>% 
  mutate(across(.cols = where(is.numeric),
                # purrr style
                .fns = ~round(.x, digits = 3)))
iris %>% 
  mutate(across(.cols = where(is.numeric),
                # anonymous function
                .fns = \(x) round(x, digits = 3)))

We’ll learn more about anonymous functions on Thursday :)

Rowwise()

The rowwise() function is a convenient way to group a dataframe by row, to do something for which a vectorised function does not exist.

df <- tibble(x = runif(6), y = runif(6), z = runif(6))

df %>% 
  rowwise() %>% 
  mutate(m = mean(c(x,y,z)))

Rowwise()

You can also use selection helpers using the c_across() function.

df %>% 
  rowwise() %>%
  mutate(m = mean(c_across(x:z)))

Exploring the wider tidyverse

Working with factors using the {forcats} package

the fct_ family helps dealing with factor variables, e.g. 

  • fct_reorder()
  • fct_recode()
  • fct_relevel()
  • fct_collapse()
  • fct_lump()
  • … (not always the most helpful naming conventions, but good documentation)

Ordering columns in ggplot using fct_infreq()

msleep %>% 
  ggplot(aes(y = vore)) +
  geom_bar(stat = "count") +
  theme_light()

Ordering columns in ggplot using fct_infreq()

msleep %>% 
  mutate(vore = fct_infreq(vore)) %>% 
  ggplot(aes(y = vore)) +
  geom_bar(stat = "count") +
  theme_light()

Ordering columns in ggplot using fct_infreq()

msleep %>% 
  mutate(vore = fct_rev(fct_infreq(vore))) %>% 
  ggplot(aes(y = vore)) +
  geom_bar(stat = "count") +
  theme_light()

case_when()

When using one ifelse() is not enough.

data <- data.frame(x = 1:70)

data %>%
  mutate(fizzy = case_when(
    x %% 35 == 0 ~ "fizz buzz",
    x %% 7 == 0 ~ "buzz",
    x %% 5 == 0 ~ "fizz",
    .default = as.character(x)
  ))

Dates, time, and the {lubridate} package

The {lubridate} package provides super easy ways to detect and convert date variables into any format imaginable.

If you are working with data that has date/time information, you should work with lubridate.

See the introduction to lubridate to get started.

Strings using the stringr package

We’ll come to that in the regex section later today.

Tidying your models with broom broom::tidy()

The {broom} package has great functions for creating tidy dataframes for almost all models out there!

See the introduction to broom to get started

broom::tidy()

lmfit <- lm(mpg ~ wt, mtcars)
lmfit

Call:
lm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.285       -5.344  
summary(lmfit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
broom::tidy(lmfit)