New intuitive ways for reshaping data in R: long live pivot_longer() and pivot_wider()
TLDR: there are two new and very intuitive R functions for reshaping data: see Examples of pivot_longer()
and pivot_wider()
below. At the time of writing, these new functions are extremely fresh and only exist in the development version on GitHub (see Installation), we should probably wait for the tidyverse team to officially release them (in CRAN) before putting them into day-to-day use.
Exciting!
Introduction
The juxtapose of data collection vs data analysis: data that was very easy to collect, is probably very hard to analyse, and vice versa. For example, if data is collected/written down whichever format was most convenient at the time of data collection, it is probably not recorded in a regularly shaped table, with various bits of information in different parts of the document. And even if data is collected into a table, it is often intuitive (for data entry) to include information about the same variable in different columns. For example, look at this example data I just made up:
library(tidyverse)
candydata_raw = read_csv("2019-04-07_candy_preference_data.csv")
candy_type | likes age: 5 | likes age: 10 | likes age: 15 | gets age: 5 | gets age: 10 | gets age: 15 |
---|---|---|---|---|---|---|
Chocolate | 4 | 6 | 8 | 2 | 4 | 6 |
Lollipop | 10 | 8 | 6 | 8 | 6 | 4 |
For each candy type, there are 8 columns with values. But actually, these 8 columns capture a combination of 3 variables: age
, likes
and eats
. This is known as the wide format, and it is a convenient way to either note down or even present values. It is human-readable. For effective data analysis, however, we need data to be in the tidy data format, where each column is a single variable, and each row a single observation (https://www.jstatsoft.org/article/view/v059i10). It needs to be less human-readable and more computer-friendly.
Some of you may remember now retired reshape2::melt()
or reshape2::dcast()
, and many of you (inclduing myself!) have struggled remebering the arguments for tidyr::gather()
and tidyr::spread()
. Based on extensive community feedback, the tidyverse team have reinveted these functions using both more intuitive names, as well as clearer syntax (arguments):
Thanks to all 2649 (!!!) people who completed my survey about table shapes! I've done analysed the data at https://t.co/hyu1o91xRm and the new functions will be called pivot_longer() and pivot_wider() #rstats
— Hadley Wickham (@hadleywickham) March 24, 2019
Installation
These functions were added just a month ago, so these functions are not yet included in the standard version of tidyr
that comes with install.packages("tidyverse")
or even update.packages()
(the current version of tidyr
on CRAN is 0.8.3). To play with the bleeding edge versions of R packages, run install.packages("devtools")
and then devtools::install_github("tidyverse/tidyr")
. If you are a Mac user and it asks you “Do you want to install from sources the package which needs compilation?”, say Yes.
You might need to Restart R (Session menu at the top) and load library(tidyverse)
again. You can check whether you now have these functions installed by typing in pivot_longer
and pressing F1 - if a relevant Help tab pops open you got it.
Examples
candydata_longer = candydata_raw %>%
pivot_longer(contains("age"))
candy_type | name | value |
---|---|---|
Chocolate | likes age: 5 | 4 |
Chocolate | likes age: 10 | 6 |
Chocolate | likes age: 15 | 8 |
Chocolate | gets age: 5 | 2 |
Chocolate | gets age: 10 | 4 |
Chocolate | gets age: 15 | 6 |
Lollipop | likes age: 5 | 10 |
Lollipop | likes age: 10 | 8 |
Lollipop | likes age: 15 | 6 |
Lollipop | gets age: 5 | 8 |
Lollipop | gets age: 10 | 6 |
Lollipop | gets age: 15 | 4 |
Now, that’s already a lot better, but we still need to split the name
column into the two different variables it really includes. “name” is what pivot_longer()
calls this new column by default. Remember, each column is a single variable.
candydata_longer = candydata_raw %>%
pivot_longer(contains("age")) %>%
separate(name, into = c("questions", NA, "age"), convert = TRUE)
candy_type | questions | age | value |
---|---|---|---|
Chocolate | likes | 5 | 4 |
Chocolate | likes | 10 | 6 |
Chocolate | likes | 15 | 8 |
Chocolate | gets | 5 | 2 |
Chocolate | gets | 10 | 4 |
Chocolate | gets | 15 | 6 |
Lollipop | likes | 5 | 10 |
Lollipop | likes | 10 | 8 |
Lollipop | likes | 15 | 6 |
Lollipop | gets | 5 | 8 |
Lollipop | gets | 10 | 6 |
Lollipop | gets | 15 | 4 |
And pivot_wider()
can be used to do the reverse:
candydata = candydata_longer %>%
pivot_wider(names_from = questions, values_from = value)
candy_type | age | likes | gets |
---|---|---|---|
Chocolate | 5 | 4 | 2 |
Chocolate | 10 | 6 | 4 |
Chocolate | 15 | 8 | 6 |
Lollipop | 5 | 10 | 8 |
Lollipop | 10 | 8 | 6 |
Lollipop | 15 | 6 | 4 |
It is important to spell out the arguments here (names_from =
, values_frame =
) since they are not the second and third arguments of pivot_wider()
(like they were in spread()
). Investigate the pivot_wider+F1
Help tab for more information.
Wrap-up and notes
Now these are datasets we can work with: each column is a variable, each row is an observation.
Do not start replacing working and tested instances of gather()
or spread()
in your existing R code with these new functions. That is neither efficient nor necessary - gather()
and spread()
will remain in tidyr
to make sure people’s scripts don’t suddenly stop working. Meaning: tidyr
is backward compatible. But after these functions are officially released, I will start using them in all new scripts I write.
I made the original messy columns still relatively nice to work with - no typos and reasonable delimiters. Usually, the labels are much worse and need the help of janitor::clean_names()
, stringr::str_replace()
, and multiple iterations of tidyr::separate()
to arrive at a nice tidy tibble/data frame.
tidyr::separate() tips:
into = c("var1", NA, "var2")
- now this is an amazing trick I only came across this week! This is a convenient way to drop useless (new) columns. Previously, I would have achieved the same result with:
... %>%
separate(..., into = c("var1", "drop", "var2")) %>%
select(-drop) %>%
...
convert = TRUE
: by default, separate()
creates new variables that are also just “characters”. This means our age would have been a chacter vector of, e.g., “5”, “10”, rather than 5, 10, and R wouldn’t have known how to do arithmetic on it. In this example, convert = TRUE
is equivalent to mutate(age = as.numeric(age))
.
Good luck!
P.S. This is one of the coolest Tweets I’ve ever seen:
On this auspicious day, in honor of tidyr, I am pleased to re-introduce you to pivot_longer and pivot_wider. #rstats pic.twitter.com/hj3zqYYbVf
— Ian Lyttle (@ijlyttle) April 1, 2019