Apply multiple functions to multiple columns in data.table

Question

I am trying to apply multiple functions to multiple columns of a data.table. Example:

DT <- data.table("a"=1:5,
                 "b"=2:6,
                 "c"=3:7)

Let's say I want to get the mean and the median of columns a and b. This works:

stats <- DT[,.(mean_a=mean(a),
               median_a=median(a),
               mean_b=mean(b),
               median_b=median(b))]

But it is way too repetitive. Is there a nice way to achieve a similar result using .SDcols and lapply?

Why not put the functions into a custom function and call that? — A5C1D2H2I1M1N2O1R2T1, Apr 14 '15 at 06:48
Or, maybe look at the development version of "data.table" where `dcast` can handle multiple column aggregations at once. — A5C1D2H2I1M1N2O1R2T1, Apr 14 '15 at 06:49
This may be easier using `dplyr` `summarise_each(DT,funs(mean, median), 1:2)` — akrun, Apr 14 '15 at 06:50
This'll be better when [`colwise()`](https://github.com/Rdatatable/data.table/issues/1063) is implemented. — Arun, Apr 14 '15 at 09:42

score 38 · Accepted Answer · answered Apr 14 '15 at 07:46

38

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median   b.mean b.median 
#     3        3        4        4

answered Apr 14 '15 at 07:46

eddi

49,088
6
104
155

5

I had a similar idea but thought the OP wanted a data.table output instead of a vector `DT[, as.list(unlist(lapply(.SD, my.summary))), .SDcols = c('a', 'b')]` – akrun Apr 14 '15 at 07:49
10

You could also probably simplify to `my.summary = function(x) c(mean = mean(x), median = median(x)) ; DT[, sapply(.SD, my.summary), .SDcols = a:b]` – David Arenburg Apr 14 '15 at 07:53
5

But this seems to be awfully slow if I add a group by category `DT[, as.list(unlist(lapply(.SD, my.summary))), by=category, .SDcols=c('a', 'b') ]` This is taking much longer than doing each summary individually and then joining. Any faster way to do this? I have about 1.5 million groups within the category column @akrun – sriramn Aug 19 '17 at 19:27
2

Worth mentioning that the output is quite different (long) if adding a `by` grouping! How would you do then? – Matifou May 02 '19 at 01:18
5

I think the code with `by` would be: `as.list(unlist(lapply(...`? – Matifou May 02 '19 at 05:16
@Matifou that works, but why the construct is `as.list(unlist(lapply(...`? – JdeMello Nov 14 '19 at 18:39
I think ideally, we would want a table that returns variables in rows and stats in columns. R's current `summary()` is far from readable, so this seems like a highly relevant problem to solve. – altabq Jun 26 '20 at 09:44
@altabq `melt`/`dcast` are your friends for that – eddi Jun 29 '20 at 18:53

score 14 · Answer 2 · edited Aug 07 '19 at 07:02

Other answers show how to do it, but no one bothered to explain the basic principle. The basic rule is that elements of lists returned by j expressions form the columns of the resulting data.table. Any j expression that produces a list, each element of which corresponds to a desired column in the result, will work. With this in mind we can use

DT[, c(mean = lapply(.SD, mean),
       median = lapply(.SD, median)),
  .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

or

DT[, unlist(lapply(.SD,
                   function(x) list(mean = mean(x),
                                    median = median(x))),
            recursive = FALSE),
   .SDcols = c('a', 'b')]
##    a.mean a.median b.mean b.median
## 1:      3        3      4        4

depending on the desired order.

Importantly we can use any method we want to produce the desired result, provided only that we arrange the result into a list as described above. For example,

library(matrixStats)
DT[, c(mean = as.list(colMeans(.SD)),
       median = setNames(as.list(colMedians(as.matrix(.SD))), names(.SD))),
   .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

also works.

I think the first example does not rename columns, but otherwise this was a very useful answer. Thanks! — Vincent, Aug 04 '20 at 13:11
Best answer that does not deviate from the principle. I've been always using `.()` for getting multiple outputs but combining this with lapply like `.(lapply(.SD), other = function(col) )` did not work well. I realized using `c(lapply(.SD,func), other = function(col))` was the right approach. — Matthew Son, May 14 '21 at 14:54

Colonel Beauvel · Answer 3 · 2015-04-14T07:41:09.577

11

This is a little bit clumsy but does the job with data.table:

funcs = c('median', 'mean', 'sum')

m = DT[, lapply(.SD, function(u){
        sapply(funcs, function(f) do.call(f,list(u)))
     })][, t(.SD)]
colnames(m) = funcs

#  median mean sum
#a      3    3  15
#b      4    4  20
#c      5    5  25

edited Apr 14 '15 at 07:41

answered Apr 14 '15 at 07:09

Colonel Beauvel

30,423
11
47
87

1

Adding a new dependency just for one `t()` call seems to be a bit overhead, why not to use chaining? `m = DT[...][, t(.SD)]`. I think it is also more readable. – jangorecki Apr 14 '15 at 07:28
How to include a function that has an additional argument? Example: `quantile(., 0.25)`. – paljenczy Apr 14 '15 at 07:45
1

Use `Curry` from `functional` package and define `funcs = c(median, sum, Curry(quantile, probs=0.25))`. But you will have to define colnames yourself at this stage. – Colonel Beauvel Apr 14 '15 at 07:48

score 3 · Answer 4 · answered Jul 03 '19 at 21:01

This might be a little over-engineered, but if you come from dplyr's summarize_at() you might want to have a similar structured result.

First define a function lapply_at() which takes a .SD and a character vector of function names as inputs. Then you can easily compute your desired statistics and get a readable result.

library(data.table)
iris_dt <- as.data.table(iris)

lapply_at <- function(var, funs, ...) {
  results <- sapply(var, function(var) {
    lapply(funs, do.call, list(var, ...))
  })
  names(results) <- vapply(names(var), paste, funs, sep = "_", 
                           FUN.VALUE = character(length(funs)),
                           USE.NAMES = FALSE)
  results
}

iris_dt[, lapply_at(.SD, c("mean", "sd"), na.rm = TRUE), 
        .SDcols = patterns("^Sepal"),
        by = Species]

#>       Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean
#> 1:     setosa             5.006       0.3524897            3.428
#> 2: versicolor             5.936       0.5161711            2.770
#> 3:  virginica             6.588       0.6358796            2.974
#>    Sepal.Width_sd
#> 1:      0.3790644
#> 2:      0.3137983
#> 3:      0.3224966

Created on 2019-07-03 by the reprex package (v0.2.0).

score 2 · Answer 5 · answered Dec 18 '18 at 07:50

use dcast

DT$dday <- 1 # add a constant column
dt <- dcast(DT, dday~dday, fun=list(sum, mean), value.var = c('a', 'b'))
# dday a_sum_1 b_sum_1 a_mean_1 b_mean_1
# 1      15      20        3        4

In fact, we can use dcast to implement onehot and feature engineer.

Apply multiple functions to multiple columns in data.table

5 Answers5

Linked

Related