Using pivot_longer in tidyr with a complex separator

Question

In a previous post here I tried to get the equivalent of an rbind using tidyr::pivotlonger(). This is the data and the solution.

set.seed(1)
df1 <- data.frame(group = rep(letters[1:2],each=3),
                 day = rep(1:3,2),
                 var1_mean = round(rnorm(6),2),
                 var1_sd = round(rnorm(6,5),2),
                 var2_mean = round(rnorm(6),2),
                 var2_sd = round(rnorm(6,5),2))


#   group day var1_mean var1_sd var2_mean var2_sd
# 1     a   1     -0.63    5.49     -0.62    5.82
# 2     a   2      0.18    5.74     -2.21    5.59
# 3     a   3     -0.84    5.58      1.12    5.92
# 4     b   1      1.60    4.69     -0.04    5.78
# 5     b   2      0.33    6.51     -0.02    5.07
# 6     b   3     -0.82    5.39      0.94    3.01


df1 %>% 
  pivot_longer(cols = starts_with('var'), 
               names_to = c('grp', '.value'),
               names_sep="_")

#     group   day grp    mean    sd
#     <fct> <int> <chr> <dbl> <dbl>
# 1   a         1 var1  -0.63  5.49
# 2   a         1 var2  -0.62  5.82
# 3   a         2 var1   0.18  5.74
# 4   a         2 var2  -2.21  5.59
# 5   a         3 var1  -0.84  5.58
# 6   a         3 var2   1.12  5.92
# 7   b         1 var1   1.6   4.69
# 8   b         1 var2  -0.04  5.78
# 9   b         2 var1   0.33  6.51
# 10  b         2 var2  -0.02  5.07
# 11  b         3 var1  -0.82  5.39
# 12  b         3 var2   0.94  3.01

This solution is quite contingent on the naming convention used for the mean and sd variables. If there is a different naming convention, with a more complex separator between the two important nodes of the column names, like so...

df2 <- data.frame(group = rep(letters[1:2],each=3),
                  day = rep(1:3,2),
                  mean_var_1 = round(rnorm(6),2),
                  sd_var_1 = round(rnorm(6,5),2),
                  mean_var_2 = round(rnorm(6),2),
                  sd_var_2 = round(rnorm(6,5),2))

df2

#   group day mean_var_1 sd_var_1 mean_var_2 sd_var_2
# 1     a   1       0.62     6.36      -0.39     5.70
# 2     a   2      -0.06     4.90      -0.06     5.56
# 3     a   3      -0.16     5.39       1.10     4.31
# 4     b   1      -1.47     4.95       0.76     4.29
# 5     b   2      -0.48     3.62      -0.16     5.36
# 6     b   3       0.42     4.59      -0.25     5.77

How would I achieve a similar result to the first example, with a single mean and sd column and with var_1 and var_2 as the grouping variable?

score 3 · Accepted Answer · answered Oct 01 '20 at 06:27

If you have names that are complicated you can use names_pattern argument where you can specify how each part of column name would be used to get data in long format.

tidyr::pivot_longer(df2, 
                    cols = contains('var'), 
                    names_to = c('.value', 'grp'), 
                    names_pattern = '(.*?)_(.*)')

#   group   day grp    mean    sd
#   <chr> <int> <chr> <dbl> <dbl>
# 1 a         1 var_1  0.62  6.36
# 2 a         1 var_2 -0.39  5.7 
# 3 a         2 var_1 -0.06  4.9 
# 4 a         2 var_2 -0.06  5.56
# 5 a         3 var_1 -0.16  5.39
# 6 a         3 var_2  1.1   4.31
# 7 b         1 var_1 -1.47  4.95
# 8 b         1 var_2  0.76  4.29
# 9 b         2 var_1 -0.48  3.62
#10 b         2 var_2 -0.16  5.36
#11 b         3 var_1  0.42  4.59
#12 b         3 var_2 -0.25  5.77

'(.*?)_(.*)' uses two groups of data where the first group is everything until the first underscore ((.*?)) in the column name and the second group is everything after the underscore following the first group ((.*)).

Using pivot_longer in tidyr with a complex separator

1 Answers1