-2

I am new in R and I guess this should be easy to do. I have a large data, where for each group in every period there might be two or three variable to estimate. The data looks like this:

df <- data.frame(
 group    = c(1, 1, 1, 1, 1),
 period   = c(1, 1, 1, 2, 2),
 term     = c("Inv", "Not", "Clue", "Mix", "Clue"),
 estimate = c(-1.2, -.85, -.35, -1, -.6),
 pvalue   = c(.001, .01, .00001, .0001, 001)
)

  group period term estimate pvalue
1     1      1  Inv    -1.20  1e-03
2     1      1  Not    -0.85  1e-02
3     1      1 Clue    -0.35  1e-05
4     1      2  Mix    -1.00  1e-04
5     1      2 Clue    -0.60  1e+00

Now I need to summerasie all information in one line for every round of each group. It needs to be in a way that rows spread over columns mentioning their estimate if it exists, and NaN if it does not. It looks like this:

newdf <- data_frame(
 group         = c(1, 1),
 period        = c(1, 2),
 Inv.estimate  = c(-1.2, NaN),
 Not.estimate  = c(-.85, NaN),
 Clue.estimate = c(-.35, -.6),
 Mix.estimate  = c(NaN, -1),
 Inv.pvalue    = c(.001, NaN),
 Not.pvalue    = c(.01, NaN),
 Clue.pvalue   = c(.00001, .001),
 Mix.pvalue    = c(NaN, .001)
)

  group period Inv.estimate Not.estimate Clue.estimate Mix.estimate Inv.pvalue Not.pvalue Clue.pvalue
1     1      1         -1.2        -0.85         -0.35          NaN      0.001       0.01       1e-05
2     1      2          NaN          NaN         -0.60           -1        NaN        NaN       1e-03
  Mix.pvalue
1        NaN
2      0.001

Is there a simple way to do this?

jazzurro
  • 23,179
  • 35
  • 66
  • 76
Homayoon
  • 45
  • 1
  • 7

2 Answers2

2

In addition to the dplyr method, we could also use dcast from data.table which can take multiple value.var columns. We convert the 'data.frame' to 'data.table' (setDT(df)) and use dcast to convert from 'long' to 'wide' format.

library(data.table)
dcast(setDT(df), group+period~term, 
         value.var= c("estimate", "pvalue"), sep=".")
#      group period estimate.Clue estimate.Inv estimate.Mix estimate.Not pvalue.Clue pvalue.Inv pvalue.Mix pvalue.Not
#1:     1      1         -0.35         -1.2           NA        -0.85     0.00001      0.001         NA       0.01
#2:     1      2         -0.60           NA           -1           NA     1.00000         NA     0.0001         NA
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can use the dplyr and tidyr packages for this. First, I used gather() to put the data in a long format. You have the values of estimate and pvalue in one column. You also have the two variable names in one column, which I named foo. Then, I used mutate() to create column names as you indicated. I remove term and foo since they are no longer necessary. Finally, I used spread() to reshape the data into a wide format.

library(dplyr)
library(tidyr)

gather(df, foo, value, -c(group, period, term)) %>%
mutate(whatever = paste(term, foo, sep = ".")) %>%
select(-term, -foo) %>%
spread(whatever, value)

  group period Clue.estimate Clue.pvalue Inv.estimate Inv.pvalue Mix.estimate Mix.pvalue Not.estimate
1     1      1         -0.35       1e-05         -1.2      0.001           NA         NA        -0.85
2     1      2         -0.60       1e+00           NA         NA           -1      1e-04           NA
  Not.pvalue
1       0.01
2         NA
jazzurro
  • 23,179
  • 35
  • 66
  • 76