3

I want to compute the mean of several columns for each group, but the columns should be given as a vector of names:

library(data.table)
DT <- data.table(k=c(1,1,2,2,2),v=1:5,w=11:15,key="k")
DT[,list(N=.N,v=mean(v),w=mean(w)),by="k"]
   k N   v    w
1: 1 2 1.5 11.5
2: 2 3 4.0 14.0

However, I don't want to specify v and w explicitly when computing means. I have another variable

mycols <- c("v","w")

which should be used instead of explicit column names.

I tried various versions of

DT[,list(.N,colMeans(.SD[mycols])),by="k"]

and got

Error in `[.data.table`(.SD, mycols) :

I wonder if there is a way to do it...

sds
  • 58,617
  • 29
  • 161
  • 278
  • 1
    You can check [here](http://stackoverflow.com/questions/14937165/using-dynamic-column-names-in-data-table) and [here](http://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia) – akrun Aug 20 '15 at 03:30

1 Answers1

3

We can concatenate .N with the means using .SDcols to choose the columns mycols. We'll also want to use lapply(.SD, mean) instead of colMeans(.SD) as colMeans() is not optimized.

DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols]
#    k N   v    w
# 1: 1 2 1.5 11.5
# 2: 2 3 4.0 14.0

So another example of this would be, if we only want "v" we use mycols[1]

DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols[1]]
#    k N   v
# 1: 1 2 1.5
# 2: 2 3 4.0

To illustrate further, if we add a column z then run the same code from above, then we see that z is not included in the result. This is because it was removed from .SD using .SDcols = mycols.

DT[, z := 21:25]
DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols]
#    k N   v    w
# 1: 1 2 1.5 11.5
# 2: 2 3 4.0 14.0
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245