data.table: lapply a function with multicolumn output

Question

I'm using a function smean.cl.normal from Hmisc package that returns a vector with 3 values: the mean and the lower and upper CI. When I use it on a data.table with 2 groups, I obtain 2 columns and 6 rows. Is there a way to obtain the result with two rows corresponding to 2 groups and separate columns for each of function's outputs, i.e. the mean and CIs?

require(Hmisc)
require(data.table)

dt = data.table(x = rnorm(100),
                gr = rep(c('A', 'B'), each = 50))

dt[, lapply(.SD, smean.cl.normal), by = gr, .SDcols = "x"]

The output:

   gr           x
1:  A -0.07916335
2:  A -0.33656667
3:  A  0.17823998
4:  B -0.02745333
5:  B -0.32950607
6:  B  0.27459941

The desired output:

   gr        Mean         Lower         Upper
1:  A -0.07916335   -0.33656667    0.17823998
2:  B -0.02745333   -0.32950607    0.27459941

There is probably a better way, but the following works: `dt2 <- dcast(dt[, lapply(.SD, smean.cl.normal), by = gr], gr ~ rowid(gr)); setnames(dt2, 2:4, c('Mean','Lower','Upper'))` — Jaap, Feb 10 '17 at 16:22
Thanks. Yes, I did it with dcast as well, it's just I thought there's something in `data.table` that I'm missing. — mattek, Feb 10 '17 at 16:25
refer to http://stackoverflow.com/questions/27494813/data-table-computing-several-column-at-once — Mahdi Jadaliha, Feb 10 '17 at 16:45

score 8 · Accepted Answer · edited Feb 10 '17 at 16:48

The j argument in DT[i,j,by] expects a list, so use as.list:

dt[, 
  Reduce(c, lapply(.SD, function(x) as.list(smean.cl.normal(x))))
, by = gr, .SDcols = "x"]

#    gr       Mean      Lower     Upper
# 1:  A  0.1032966 -0.1899466 0.3965398
# 2:  B -0.1437617 -0.4261330 0.1386096

c(L1, L2, L3) is how lists are combined, so Reduce(c, List_o_Lists) does the trick in case your .SDcols contains more than just x. I guess do.call(c, List_o_Lists) should also work.

Comments

This is quite inefficient for a couple of reasons. Turn on verbose=TRUE to see that data.table doesn't like getting named lists in j:

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Also, you are missing out on group-optimized versions of mean and other functions that can probably be used to build your result. This may not be a big deal for your use-case, though.

When you're applying this to only a single value column, just:

dt[, as.list(smean.cl.normal(x)), by = gr]

suffices.

data.table: lapply a function with multicolumn output

1 Answers1

Linked