Mean of list of coumns by group

Question

I want to compute the mean of several columns for each group, but the columns should be given as a vector of names:

library(data.table)
DT <- data.table(k=c(1,1,2,2,2),v=1:5,w=11:15,key="k")
DT[,list(N=.N,v=mean(v),w=mean(w)),by="k"]
   k N   v    w
1: 1 2 1.5 11.5
2: 2 3 4.0 14.0

However, I don't want to specify v and w explicitly when computing means. I have another variable

mycols <- c("v","w")

which should be used instead of explicit column names.

I tried various versions of

DT[,list(.N,colMeans(.SD[mycols])),by="k"]

and got

Error in `[.data.table`(.SD, mycols) :

I wonder if there is a way to do it...

You can check [here](http://stackoverflow.com/questions/14937165/using-dynamic-column-names-in-data-table) and [here](http://stackoverflow.com/questions/24833247/how-can-one-work-fully-generically-in-data-table-in-r-with-column-names-in-varia) — akrun, Aug 20 '15 at 03:30

Rich Scriven · Accepted Answer · 2015-08-20T17:28:32.713

3

We can concatenate .N with the means using .SDcols to choose the columns mycols. We'll also want to use lapply(.SD, mean) instead of colMeans(.SD) as colMeans() is not optimized.

DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols]
#    k N   v    w
# 1: 1 2 1.5 11.5
# 2: 2 3 4.0 14.0

So another example of this would be, if we only want "v" we use mycols[1]

DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols[1]]
#    k N   v
# 1: 1 2 1.5
# 2: 2 3 4.0

To illustrate further, if we add a column z then run the same code from above, then we see that z is not included in the result. This is because it was removed from .SD using .SDcols = mycols.

DT[, z := 21:25]
DT[, c(N = .N, lapply(.SD, mean)), by = k, .SDcols = mycols]
#    k N   v    w
# 1: 1 2 1.5 11.5
# 2: 2 3 4.0 14.0

edited Aug 20 '15 at 17:28

answered Aug 20 '15 at 03:20

Rich Scriven

97,041
11
181
245

I have other columns (besides `v` & `w`) which I do not want to include. – sds Aug 20 '15 at 03:51
@sds - Right. That is what `.SDcols` is for. It chooses the columns for `.SD`. If we add the column `DT$z <- 21:25` then run the code above we can see that `z` will not be included in the result nor the calculation – Rich Scriven Aug 20 '15 at 03:54
Added that example into my answer – Rich Scriven Aug 20 '15 at 03:58
thanks - how is `.(N=.N)` different from `N=.N`? – sds Aug 20 '15 at 13:16
thanks; it appears to be unnecessary though in this case. – sds Aug 20 '15 at 17:26
@sds - ah yes, you're right. Not in this case. Disregard that :) – Rich Scriven Aug 20 '15 at 17:27

Mean of list of coumns by group

1 Answers1