Reshaping multiple variables into tidy data (wide to long)
There’s some explanation on what reshaping data in R means, why we do it, as well as the history, e.g., melt()
vs gather()
vs pivot_longer()
in a previous post: New intuitive ways for reshaping data in R
That post shows how to reshape a single variable that had been recorded/entered across multiple different columns. But if multiple different variables are recorded over multiple different columns, then this is what you might want to do:
Example data
# from dput():
widedata = structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
time_1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
time_2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2),
time_3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3),
time_4 = c(4, 4, 4, 4, 4, 4, 4, 4, 4),
outcome_1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1),
outcome_2 = c(2, 2, 2, 2, 2, 2, 2, 2, 2),
outcome_3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3),
outcome_4 = c(4, 4, 4, 4, 4, 4, 4, 4, 4)),
row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
This is what it looks like:
And this is what we want it to look like:
Simple solution
library(tidyverse)
# pivot_longer puts everything that is not id into two columns (name and value)
# separate the variable names and stage numbers
# puts both variables back into two columns using pivot_wider()
longdata = widedata %>%
pivot_longer(-id) %>%
separate(name, into = c("name", "stage"), sep = "_") %>%
pivot_wider(names_from = "name", values_from = "value")
I think this is a very good solution as it’s easy to understand and debug. See below for a step by step explanation of these lines.
It is, however, possible to achieve the same result just by using clever arguments inside the pivot_longer()
(thank you Lisa for teaching me this):
Clever solution
# same result as above but just pivot_longer()
widedata %>%
pivot_longer(-id,
names_to = c(".value", "number"),
names_pattern = "(.+)_(.+)")
## # A tibble: 36 × 4
## id number time outcome
## <dbl> <chr> <dbl> <dbl>
## 1 1 1 1 1
## 2 1 2 2 2
## 3 1 3 3 3
## 4 1 4 4 4
## 5 2 1 1 1
## 6 2 2 2 2
## 7 2 3 3 3
## 8 2 4 4 4
## 9 3 1 1 1
## 10 3 2 2 2
## # … with 26 more rows
Simple solution step by step
Combine all into two columns first
pivot_longer(-id)
combines all columns except id
into two columns (name and value):
widedata %>%
pivot_longer(-id)
## # A tibble: 72 × 3
## id name value
## <dbl> <chr> <dbl>
## 1 1 time_1 1
## 2 1 time_2 2
## 3 1 time_3 3
## 4 1 time_4 4
## 5 1 outcome_1 1
## 6 1 outcome_2 2
## 7 1 outcome_3 3
## 8 1 outcome_4 4
## 9 2 time_1 1
## 10 2 time_2 2
## # … with 62 more rows
The defaults name
and value
may be changed using these arguments:
widedata %>%
pivot_longer(-id, names_to = "my_variables", values_to = "my_values") %>% slice(1)
## # A tibble: 1 × 3
## id my_variables my_values
## <dbl> <chr> <dbl>
## 1 1 time_1 1
- using
%>% slice(1)
for brevity
If you have multiple columns that you don’t want collected, then it’s easier to select the ones you want. In this example, it would look like this:
widedata %>%
pivot_longer(matches("time|outcome")) %>% slice(1)
## # A tibble: 1 × 3
## id name value
## <dbl> <chr> <dbl>
## 1 1 time_1 1
In this example dataset, matches("time|outcome")
has the same effect as -id
.
Search for “tidyverse select helpers” to see the various options available for selecting the variables you need (select helpers reference).
Separate column names and numbers
widedata %>%
pivot_longer(-id) %>%
separate(name, into = c("name", "stage"), sep = "_") %>% slice(1)
## # A tibble: 1 × 4
## id name stage value
## <dbl> <chr> <chr> <dbl>
## 1 1 time 1 1
pivot_wider() so that each variable has its own column
And the final step of this solution is pivot_wider()
which takes the multiple variables that pivot_longer()
combined into name
and puts them into their own columns:
widedata %>%
pivot_longer(-id) %>%
separate(name, into = c("name", "stage"), sep = "_") %>%
pivot_wider(names_from = "name", values_from = "value")
## # A tibble: 36 × 4
## id stage time outcome
## <dbl> <chr> <dbl> <dbl>
## 1 1 1 1 1
## 2 1 2 2 2
## 3 1 3 3 3
## 4 1 4 4 4
## 5 2 1 1 1
## 6 2 2 2 2
## 7 2 3 3 3
## 8 2 4 4 4
## 9 3 1 1 1
## 10 3 2 2 2
## # … with 26 more rows
What if my columns don’t have delimiters (e.g., instead of time_1, time_2,...
it’s time1, time2, ...
)
In that case I would use mutate()
+ str_extract()
/str_remove()
:
widedata %>%
pivot_longer(-id) %>%
mutate(stage = str_extract(name, "[:digit:]")) %>%
mutate(variable = str_remove(name, "_[:digit:]")) %>%
pivot_wider(names_from = "name", values_from = "value")
The first mutate extracts the number ("[:digit:]"
) from the column called name
(which is the result of pivot_longer()
), I’ve called the new variable that gets this number stage
but you can call it anything. We then remove the number from name
as it now lives in a column of its own.
This extraction is actually what the tidyr::extract()
function is for, but it always takes me much longer to get extract()
to work as compared to the easy to manage mutate()
+ str_extract()
/str_remove()
combo above.
Final words
Reshaping data is really tricky, and your spreadsheet from hell is likely much more complicated than the simple example here. It always takes me lots of trial and error to get these things to work properly. Especially if there are irregularities in the data. You may find janitor::clean_names()
useful, or you may need to do more cleanup using various str_()
functions from the stringr package.
Good luck!