1

I have a data.frame with the mean and standard error for two variables, var1 and var2.

This data.frame, original_df, came from creating those statistics from data for each of two groups:

original_df <- data.frame(group_dummy_code = c(0, 1),
           var1_mean = c(1.5, 2.5),
           var1_se = c(.025, .05),
           var2_mean = c(3.5, 4.5),
           var2_se = c(.075, .1))

> original_df
  group_dummy_code var1_mean var1_se var2_mean var2_se
1                0       1.5   0.025       3.5   0.075
2                1       2.5   0.050       4.5   0.100

I'm trying to use the tidyr function gather() to change the data.frame into desired_df in order to plot the two variables' means and standard errors:

desired_df <- data.frame(group_dummy_code = c(0, 1, 0, 1),
                         key = c("var1", "var1", "var2", "var2"),
                         val_mean = c(1.5, 2.5, 3.5, 4.5),
                         val_se = c(.025, .05, .075, .1))

> desired_df
  group_dummy_code  key val_mean val_se
1                0 var1      1.5  0.025
2                1 var1      2.5  0.050
3                0 var2      3.5  0.075
4                1 var2      4.5  0.100

I tried to gather() twice with the following:

df %>%
    gather(mean_key, mean_val, -group_dummy_code, -contains("se")) %>% 
    gather(se_key, se_val, -group_dummy_code, -mean_key, -mean_val)

But, this results in too many rows (in particular, with multiple standard errors for each mean):

  group_dummy_code  mean_key mean_val  se_key se_val
1                0 var1_mean      1.5 var1_se  0.025
2                1 var1_mean      2.5 var1_se  0.050
3                0 var2_mean      3.5 var1_se  0.025
4                1 var2_mean      4.5 var1_se  0.050
5                0 var1_mean      1.5 var2_se  0.075
6                1 var1_mean      2.5 var2_se  0.100
7                0 var2_mean      3.5 var2_se  0.075
8                1 var2_mean      4.5 var2_se  0.100

This seems like a fairly common processing step, especially after creating statistics for the mean and standard deviation for a number of variables, but gather()ing twice--once for the mean and once for the standard error variables--doesn't seem like a good approach.

Using tidyr (or dplyr or another package), how can I create desired_df from original_df?

Joshua Rosenberg
  • 4,014
  • 9
  • 34
  • 73
  • See also [Reshaping multiple sets of measurement columns (wide format) into single columns (long format)](http://stackoverflow.com/questions/12466493/reshaping-multiple-sets-of-measurement-columns-wide-format-into-single-columns) – Henrik Jan 31 '17 at 03:45

1 Answers1

1

tidyr::gather() doesn't provide a functionality to reshape data frames with multi-value columns, if you want to stick to tidyr, you can do it with gather-separate-spread:

library(tidyr)
original_df %>% 
    gather(var_stats, value, -group_dummy_code) %>% 
    separate(var_stats, into = c("var", "stats")) %>% 
    spread(stats, value)

#  group_dummy_code  var mean    se
#1                0 var1  1.5 0.025
#2                0 var2  3.5 0.075
#3                1 var1  2.5 0.050
#4                1 var2  4.5 0.100
Psidom
  • 209,562
  • 33
  • 339
  • 356