Is there an smart way to gather() R wide data to long data where there are hundreds of different variables to gather?

Question

I have a dataset with hundreds of variables that is organized in wide format, like that:

df <- tibble(
             subject = 1:800,
             var1.W1 = sample(1:4, replace=T, 800), # var 1 measured at wave 1
             var1.W2 = sample(1:4, replace=T, 800), # var 1 measured at wave 2
             var1.W3 = sample(1:4, replace=T, 800), # var 1 measured at wave 3
             var1.W4 = sample(1:4, replace=T, 800), # var 1 measured at wave 4
...
             var_n.W1 = sample(1:4, replace=T, 800), # var n measured at wave 1
             var_n.W2 = sample(1:4, replace=T, 800), # var n measured at wave 2
             var_n.W3 = sample(1:4, replace=T, 800), # var n measured at wave 3
             var_n.W4 = sample(1:4, replace=T, 800)  # var n measured at wave 4
             )

In order to transform that wide data to long data, I am doing that:

df2 <- df %>%
  gather(var1, var1_value, c(var1.W1, var1.W2, var1.W3, var1.W4))
df2 <- df2 %>%
  gather(var2, var2_value, c(var2.W1, var2.W2, var2.W3, var2.W4))
# Etc...

I suspect that isn't the smartest thing to do. Is there same good alternative to covert this kind of wide data to long data? (If it helps: all variables I want to gather has the same prefix and the sufix (.W1, .W2, .W3 or .W4)) point what wave it come from.

score 3 · Accepted Answer · answered Jul 09 '20 at 00:49

3

You can do this with pivot_longer :

tidyr::pivot_longer(df, cols = -subject, 
                    names_to = c('.value', 'value'), 
                    names_sep = '\\.')


#   subject value  var1 var_n
#     <int> <chr> <int> <int>
# 1       1 W1        4     3
# 2       1 W2        2     1
# 3       1 W3        4     2
# 4       1 W4        3     1
# 5       2 W1        2     4
# 6       2 W2        4     2
# 7       2 W3        1     1
# 8       2 W4        2     3
# 9       3 W1        4     4
#10       3 W2        2     4
#....

answered Jul 09 '20 at 00:49

Ronak Shah

377,200
20
156
213

Thank you so much. It works! But it emerged another challenge: Error: Can't combine `sex.W1` and `sex.W4` . There are a lot of variables that was coded with different types across waves... Do you know same trick to coerce variables to the most complex type? – Tormod Jul 09 '20 at 06:28
If the types are different you might want to change everything to character first. Something like `df %>% mutate(across(-subject, as.character)) %>% tidyr::pivot_longer(.....` Or with `mutate_at`, `df %>% mutate_at(-1, as.character).....` – Ronak Shah Jul 09 '20 at 06:33

score 1 · Answer 2 · answered Jul 09 '20 at 00:38

Does the following help you a little?
(Please share an example of your desired output if this is missing the point)

library(dplyr)

df2 <- df %>%
  pivot_longer(
    starts_with('var'),
    names_to = 'var',
    values_to = 'value'
  ) %>%
  mutate(
    var_n = str_extract(var, '^var[0-9n_]+(?=\\.)'),
    W_n = str_extract(var, '(?<=\\.)W\\d+'),
  ) %>%
  select(-var) %>%
  pivot_wider(
    names_from = var_n,
    values_from = value
  )

I'm essentially gathering the whole df into three columns (subject, var & value) and then spread it out var-wise.
Since you say you have W1 through W4 for all vars, I guess it makes sense to have four rows per Subject and the respective var value in a column that follows.

The output of that becomes

> print(df2)
 # A tibble: 3,200 x 5
   subject W_n    var1  var2 var_n
     <int> <chr> <int> <int> <int>
 1       1 W1        1     4     4
 2       1 W2        4     1     3
 3       1 W3        3     3     4
 4       1 W4        1     3     3
 5       2 W1        2     1     4
 6       2 W2        2     2     1
 7       2 W3        2     2     4
 8       2 W4        3     2     3
 9       3 W1        2     4     1
10       3 W2        1     2     2

Is there an smart way to gather() R wide data to long data where there are hundreds of different variables to gather?

2 Answers2