3

I'd like to calculate group means in a data frame and create a new column in the original data frame containing those group mean values. (I'm doing a repeatability study and I want the mean value over measurements within an insertion, unit, and channel in a new column so I can subtract it off and calculate residuals.)

My data:

> head(mytestdata,15)
   Insertion Measurement Unit Channel Value
1          1           1   A5      10  9.41
2          1           1   A5      11  9.51
3          1           1   A5      12 10.59
4          1           1   A5      13  9.45
5          1           2   A5      10  9.42
6          1           2   A5      11  9.03
7          1           2   A5      12 10.62
8          1           2   A5      13  9.39
9          1           3   A5      10  9.38
10         1           3   A5      11  9.87
11         1           3   A5      12 11.34
12         1           3   A5      13  9.59
13         2           1   A5      10 12.10
14         2           1   A5      11 11.28
15         2           1   A5      12 12.95

Specifically, I want to calculate the mean Value per Insertion, Unit, and Channel, and add it to the data frame as meanValue. Then subtract meanValue from Value to get Residual.

Should look like this:

   Insertion Measurement Unit Channel Value meanValue
1          1           1   40      10 11.79     11.56
2          1           1   40      11 11.01     11.38
3          1           1   40      12 10.86     11.19
4          1           1   40      13 10.29     10.91
5          1           2   40      10 11.47     11.56
6          1           2   40      11 11.84     11.38
7          1           2   40      12 11.39     11.19
8          1           2   40      13 11.25     10.91
9          1           3   40      10 11.42     11.56
10         1           3   40      11 11.28     11.38
11         1           3   40      12 11.31     11.19
12         1           3   40      13 11.18     10.91
13         2           1   40      10 10.97     11.55
14         2           1   40      11 11.78     11.87
15         2           1   40      12 11.48     11.25

I know how to get the group means using by, aggregate, etc, which get me a second list or table with the values in it. I'm also confident I could get what I want using some convoluted looping procedures, but I'm looking to stuff them back in the same data frame in an elegant one- or two-line solution, and I figure there's got to be a way to do it but after days of searching I'm not finding it. I don't want a cumbersome solution because I want it to work well when I scale up to lots more data.

M Greenwood
  • 31
  • 1
  • 2

2 Answers2

6

You can use ave to calculate the groupmeans:

df$MeanValue <- with(df, ave(Value, Insertion, Unit, Channel, FUN = mean))

Then calculate the residuals:

df$Residual <- df$Value - df$MeanValue
df
#   Insertion Measurement Unit Channel Value MeanValue     Residual
#1          1           1   A5      10  9.41  9.403333  0.006666667
#2          1           1   A5      11  9.51  9.470000  0.040000000
#3          1           1   A5      12 10.59 10.850000 -0.260000000
#4          1           1   A5      13  9.45  9.476667 -0.026666667
#5          1           2   A5      10  9.42  9.403333  0.016666667
#6          1           2   A5      11  9.03  9.470000 -0.440000000
#7          1           2   A5      12 10.62 10.850000 -0.230000000
#8          1           2   A5      13  9.39  9.476667 -0.086666667
#9          1           3   A5      10  9.38  9.403333 -0.023333333
#10         1           3   A5      11  9.87  9.470000  0.400000000
#11         1           3   A5      12 11.34 10.850000  0.490000000
#12         1           3   A5      13  9.59  9.476667  0.113333333
#13         2           1   A5      10 12.10 12.100000  0.000000000
#14         2           1   A5      11 11.28 11.280000  0.000000000
#15         2           1   A5      12 12.95 12.950000  0.000000000

Or you could use dplyr

library(dplyr)

df %>% group_by(Insertion, Unit, Channel) %>% mutate(MeanValue = mean(Value), Residual = Value - MeanValue)
talat
  • 68,970
  • 21
  • 126
  • 157
2

Using data.table

library(data.table)
setDT(mytestdata)[, c("MeanValue", "Residual") := {m= mean(Value);list(m, Value-m)}, by=list(Insertion, Unit, Channel)]

mytestdata

#        Insertion Measurement Unit Channel Value MeanValue     Residual
#       1:         1           1   A5      10  9.41  9.403333  0.006666667
#       2:         1           1   A5      11  9.51  9.470000  0.040000000
#       3:         1           1   A5      12 10.59 10.850000 -0.260000000
#       4:         1           1   A5      13  9.45  9.476667 -0.026666667
#       5:         1           2   A5      10  9.42  9.403333  0.016666667
#       6:         1           2   A5      11  9.03  9.470000 -0.440000000
#       7:         1           2   A5      12 10.62 10.850000 -0.230000000
#       8:         1           2   A5      13  9.39  9.476667 -0.086666667
#       9:         1           3   A5      10  9.38  9.403333 -0.023333333
#      10:         1           3   A5      11  9.87  9.470000  0.400000000
#      11:         1           3   A5      12 11.34 10.850000  0.490000000
#      12:         1           3   A5      13  9.59  9.476667  0.113333333
#      13:         2           1   A5      10 12.10 12.100000  0.000000000
#      14:         2           1   A5      11 11.28 11.280000  0.000000000
#      15:         2           1   A5      12 12.95 12.950000  0.000000000
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @Arun, Thanks, I did a microbenchmark, Seems like both dplyr and data.table are very close. Any other faster variety? – akrun Jul 31 '14 at 19:42
  • On this data, any benchmark is really meaningless.. :). I'd try benchmarking on about 20 million rows, with different group sizes, say: a total of 1000 groups, 10,000, 100,000, a million groups.. You'll see where and how `data.table` scales! – Arun Jul 31 '14 at 19:45
  • akrun, [here you go](https://gist.github.com/arunsrinivasan/37a051c65275b58861fe). You can play around with it.. – Arun Jul 31 '14 at 20:32
  • @Arun, thanks. Looks like data.table beats dplyr in all the comparisons especially as groups gets bigger :-) – akrun Aug 01 '14 at 05:48
  • @akrun, how did you come up with this part: `{m= mean(Value);list(m, Value-m)}` is there some documentation you saw on it or thought of it yourself? – David Arenburg Aug 01 '14 at 06:35
  • @David Arenburg, From an other solution of me where Arun edited. – akrun Aug 01 '14 at 06:37
  • Interesting, I never thought you can do that in `data.table` – David Arenburg Aug 01 '14 at 06:38
  • @David Arenburg. There is another approach. `setDT(mytestdata)[,Mean.Value:=mean(Value), by=list(Insertion, Unit, Channel)][,Residual:=Value-Mean.Value]` – akrun Aug 01 '14 at 06:43
  • This one I'm familiar with. It is pretty straight forward and this is what I was using until now – David Arenburg Aug 01 '14 at 06:44