39

I am trying to apply multiple functions to multiple columns of a data.table. Example:

DT <- data.table("a"=1:5,
                 "b"=2:6,
                 "c"=3:7)

Let's say I want to get the mean and the median of columns a and b. This works:

stats <- DT[,.(mean_a=mean(a),
               median_a=median(a),
               mean_b=mean(b),
               median_b=median(b))]

But it is way too repetitive. Is there a nice way to achieve a similar result using .SDcols and lapply?

Henrik
  • 65,555
  • 14
  • 143
  • 159
paljenczy
  • 4,779
  • 8
  • 33
  • 46

5 Answers5

38

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median   b.mean b.median 
#     3        3        4        4 
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 5
    I had a similar idea but thought the OP wanted a data.table output instead of a vector `DT[, as.list(unlist(lapply(.SD, my.summary))), .SDcols = c('a', 'b')]` – akrun Apr 14 '15 at 07:49
  • 10
    You could also probably simplify to `my.summary = function(x) c(mean = mean(x), median = median(x)) ; DT[, sapply(.SD, my.summary), .SDcols = a:b]` – David Arenburg Apr 14 '15 at 07:53
  • 5
    But this seems to be awfully slow if I add a group by category `DT[, as.list(unlist(lapply(.SD, my.summary))), by=category, .SDcols=c('a', 'b') ]` This is taking much longer than doing each summary individually and then joining. Any faster way to do this? I have about 1.5 million groups within the category column @akrun – sriramn Aug 19 '17 at 19:27
  • 2
    Worth mentioning that the output is quite different (long) if adding a `by` grouping! How would you do then? – Matifou May 02 '19 at 01:18
  • 5
    I think the code with `by` would be: `as.list(unlist(lapply(...`? – Matifou May 02 '19 at 05:16
  • @Matifou that works, but why the construct is `as.list(unlist(lapply(...`? – JdeMello Nov 14 '19 at 18:39
  • I think ideally, we would want a table that returns variables in rows and stats in columns. R's current `summary()` is far from readable, so this seems like a highly relevant problem to solve. – altabq Jun 26 '20 at 09:44
  • @altabq `melt`/`dcast` are your friends for that – eddi Jun 29 '20 at 18:53
14

Other answers show how to do it, but no one bothered to explain the basic principle. The basic rule is that elements of lists returned by j expressions form the columns of the resulting data.table. Any j expression that produces a list, each element of which corresponds to a desired column in the result, will work. With this in mind we can use

DT[, c(mean = lapply(.SD, mean),
       median = lapply(.SD, median)),
  .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

or

DT[, unlist(lapply(.SD,
                   function(x) list(mean = mean(x),
                                    median = median(x))),
            recursive = FALSE),
   .SDcols = c('a', 'b')]
##    a.mean a.median b.mean b.median
## 1:      3        3      4        4

depending on the desired order.

Importantly we can use any method we want to produce the desired result, provided only that we arrange the result into a list as described above. For example,

library(matrixStats)
DT[, c(mean = as.list(colMeans(.SD)),
       median = setNames(as.list(colMedians(as.matrix(.SD))), names(.SD))),
   .SDcols = c('a', 'b')]
##    mean.a mean.b median.a median.b
## 1:      3      4        3        4

also works.

thothal
  • 16,690
  • 3
  • 36
  • 71
Ista
  • 10,139
  • 2
  • 37
  • 38
  • 2
    I think the first example does not rename columns, but otherwise this was a very useful answer. Thanks! – Vincent Aug 04 '20 at 13:11
  • Best answer that does not deviate from the principle. I've been always using `.()` for getting multiple outputs but combining this with lapply like `.(lapply(.SD), other = function(col) )` did not work well. I realized using `c(lapply(.SD,func), other = function(col))` was the right approach. – Matthew Son May 14 '21 at 14:54
11

This is a little bit clumsy but does the job with data.table:

funcs = c('median', 'mean', 'sum')

m = DT[, lapply(.SD, function(u){
        sapply(funcs, function(f) do.call(f,list(u)))
     })][, t(.SD)]
colnames(m) = funcs

#  median mean sum
#a      3    3  15
#b      4    4  20
#c      5    5  25
Colonel Beauvel
  • 30,423
  • 11
  • 47
  • 87
  • 1
    Adding a new dependency just for one `t()` call seems to be a bit overhead, why not to use chaining? `m = DT[...][, t(.SD)]`. I think it is also more readable. – jangorecki Apr 14 '15 at 07:28
  • How to include a function that has an additional argument? Example: `quantile(., 0.25)`. – paljenczy Apr 14 '15 at 07:45
  • 1
    Use `Curry` from `functional` package and define `funcs = c(median, sum, Curry(quantile, probs=0.25))`. But you will have to define colnames yourself at this stage. – Colonel Beauvel Apr 14 '15 at 07:48
3

This might be a little over-engineered, but if you come from dplyr's summarize_at() you might want to have a similar structured result.

First define a function lapply_at() which takes a .SD and a character vector of function names as inputs. Then you can easily compute your desired statistics and get a readable result.

library(data.table)
iris_dt <- as.data.table(iris)

lapply_at <- function(var, funs, ...) {
  results <- sapply(var, function(var) {
    lapply(funs, do.call, list(var, ...))
  })
  names(results) <- vapply(names(var), paste, funs, sep = "_", 
                           FUN.VALUE = character(length(funs)),
                           USE.NAMES = FALSE)
  results
}

iris_dt[, lapply_at(.SD, c("mean", "sd"), na.rm = TRUE), 
        .SDcols = patterns("^Sepal"),
        by = Species]

#>       Species Sepal.Length_mean Sepal.Length_sd Sepal.Width_mean
#> 1:     setosa             5.006       0.3524897            3.428
#> 2: versicolor             5.936       0.5161711            2.770
#> 3:  virginica             6.588       0.6358796            2.974
#>    Sepal.Width_sd
#> 1:      0.3790644
#> 2:      0.3137983
#> 3:      0.3224966

Created on 2019-07-03 by the reprex package (v0.2.0).

der_grund
  • 1,898
  • 20
  • 36
2

use dcast

DT$dday <- 1 # add a constant column
dt <- dcast(DT, dday~dday, fun=list(sum, mean), value.var = c('a', 'b'))
# dday a_sum_1 b_sum_1 a_mean_1 b_mean_1
# 1      15      20        3        4

In fact, we can use dcast to implement onehot and feature engineer.