1

I have a dataframe with multiple factors and multiple numeric vars. I would like to collapse one of the factors (say by mean).

In my attempts I could only think of nested sapply or for loops to isolate the numerical elements to be averaged.

var <- data.frame(A = c(rep('a',8),rep('b',8)), B = 
c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
                  D = rnorm(16), E = rnorm(16))
> var
   A B C           D           E
1  a c e  1.1601720731 -0.57092435
2  a c f -0.0120178626  1.05003748
3  a d e  0.5311032778  1.67867806
4  a d f -0.3399901000  0.01459940
5  a c e -0.2887561691 -0.03847519
6  a c f  0.0004299922 -0.36695879
7  a d e  0.8124655890  0.05444033
8  a d f -0.3777058654  1.34074427
9  b c e  0.7380720821  0.37708543
10 b c f -0.3163496271  0.10921373
11 b d e -0.5543252191  0.35020193
12 b d f -0.5753686426  0.54642790
13 b c e -1.9973216646  0.63597405
14 b c f -0.3728926714 -3.07669300
15 b d e -0.6461596329 -0.61659041
16 b d f -1.7902722068 -1.06761729


sapply(4:ncol(var), function(i){
  sapply(1:length(levels(var$A)), function(j){
    sapply(1:length(levels(var$B)), function(t){
      sapply(1:length(levels(var$C)), function(z){
        mean(var[var$A == levels(var$A)[j] & 
var$B == levels(var$B)[t] & 
var$C == levels(var$C)[z],i])
      })
    })
  })
})

             [,1]       [,2]
[1,]  0.435707952 -0.3046998
[2,] -0.005793935  0.3415393
[3,]  0.671784433  0.8665592
[4,] -0.358847983  0.6776718
[5,] -0.629624791  0.5065297
[6,] -0.344621149 -1.4837396
[7,] -0.600242426 -0.1331942
[8,] -1.182820425 -0.2605947

Is there a way to do this without this many sapply? maybe with mapply or outer

CAOC
  • 323
  • 5
  • 18

3 Answers3

5

Maybe just,

var <- data.frame(A = c(rep('a',8),rep('b',8)), B = 
                    c(rep(c(rep('c',2),rep('d',2)),4)), C = c(rep(c('e','f'),8)),
                  D = rnorm(16), E = rnorm(16))

library(dplyr)
var %>%
  group_by(A,B,C) %>%
  summarise_if(is.numeric,mean)

(Note that the output you show isn't what I get when I run your sapply code, but the above is identical to what I get when I run your sapply's.)

joran
  • 169,992
  • 32
  • 429
  • 468
  • sorry, that's bc I ran it twice without control for the seed. your code does the trick elegantly. Thanks – CAOC Aug 13 '19 at 19:50
4

For inline aggregation (keeping same number of rows of data frame), consider ave:

var$D_mean <- with(var, ave(D, A, B, C, FUN=mean))
var$E_mean <- with(var, ave(E, A, B, C, FUN=mean))

For full aggregation (collapsed to factor groups), consider aggregate:

aggregate(. ~ A + B + C, var, mean)
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 3
    I think you need `aggregate(cbind(D, E) ~ ., FUN = mean, data = var)`. – tmfmnk Aug 13 '19 at 19:44
  • Correct! Did not see the second numeric. – Parfait Aug 13 '19 at 19:45
  • Just a follow up, as I like this solution as well. if I had 100 columns, I'd have to put the `cbind` in all? Is there a way to just consider them all without the `cbind` reference? – CAOC Aug 13 '19 at 20:27
  • Super thanks, @IceCreamToucan and @Parfait !! `aggregate` is actually a lot simpler to use when the output is not a single number but a vector, like the z-score of a vector. – CAOC Aug 18 '19 at 22:55
3

I will complete the holy trinity with a data.table solution. Here .SD is a data.table of all the columns not listed in the by portion. This is a near-dupe of this question (only difference is >1 column being summarized), so click that if you want more solutions.

library(data.table)
setDT(var)

var[, lapply(.SD, mean), by = .(A, B, C)]
#    A B C           D            E
# 1: a c e  0.07465822  0.032976115
# 2: a c f  0.40789460 -0.944631574
# 3: a d e  0.72054938  0.039781185
# 4: a d f -0.12463910  0.003363382
# 5: b c e -1.64343115  0.806838905
# 6: b c f -1.08122890 -0.707975411
# 7: b d e  0.03937829  0.048136471
# 8: b d f -0.43447899  0.028266455
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38