R data.table - Apply function A to some columns and function B to some others

Question

I want to aggregate datatable's row, but the aggragation function depends on the name of the column.

For example, if column name is:

variable1 or variable2, then apply the mean() function.
variable3, then apply the max() function.
variable4, then apply the sd() function.

My datatables always have a datetime column: I want to aggregate rows by time. However, the number of "data" column can vary.

I know how to do that with the same aggregation function (e.g. mean()) for all columns:

dt <- dt[, lapply(.SD, mean),
           by = .(datetime = floor_date(datetime, timeStep))]

Or for only a subset of columns:

cols <- c("variable1", "variable2")    
dt <- dt[ ,(cols) := lapply(.SD, mean), 
            by = .(datetime = floor_date(datetime, timeStep)),
            .SDcols = cols]

What I would like to do is something like:

colsToMean <- c("variable1", "variable2") 
colsToMax <- c("variable3")   
colsToSd <- c("variable4")   
dt <- dt[ ,{(colsToMean) := lapply(.SD???, mean),
             (colsToMax) := lapply(.SD???, max),
             (colsToSd) :=  lapply(.SD???, sd)}, 
            by = .(datetime = floor_date(datetime, timeStep)),
            .SDcols = (colsToMean, colsToMax, colsToSd)]

I looked at data.table in R - apply multiple functions to multiple columns which gave me the idea to use a custom function:

myAggregate <- function(x, columnName) {
   FUN = getAggregateFunction(columnName) # Return mean() or max() or sd()
   return FUN(x)
}
dt <- dt[, lapply(.SD, myAggregate, ???columName???),
           by = .(datetime = floor_date(datetime, timeStep))]

But I don't know how to pass the current column name to myAggregate()...

score 6 · Accepted Answer · edited Jun 01 '18 at 11:12

6

Here is one way to do it with Map or mapply:

Let's make some toy data first:

dt <- data.table(
    variable1 = rnorm(100),
    variable2 = rnorm(100),
    variable3 = rnorm(100),
    variable4 = rnorm(100),
    grp = sample(letters[1:5], 100, replace = T)
)

colsToMean <- c("variable1", "variable2") 
colsToMax <- c("variable3")   
colsToSd <- c("variable4")

Then,

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c(mean, max, sd), lengths(scols))

# summary
dt[, Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

# or replace the original values with summary statistics as in OP
dt[, unlist(scols) := Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]

Another option with GForce on:

scols <- list(colsToMean, colsToMax, colsToSd)
funs <- rep(c('mean', 'max', 'sd'), lengths(scols))

jexp <- paste0('list(', paste0(funs, '(', unlist(scols), ')', collapse = ', '), ')')
dt[, eval(parse(text = jexp)), by = grp, verbose = TRUE]

# Detected that j uses these columns: variable1,variable2,variable3,variable4 
# Finding groups using forderv ... 0.000sec 
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
# Getting back original order ... 0.000sec 
# lapply optimization is on, j unchanged as 'list(mean(variable1), mean(variable2), max(variable3), sd(variable4))'
# GForce optimized j to 'list(gmean(variable1), gmean(variable2), gmax(variable3), gsd(variable4))'
# Making each group and running j (GForce TRUE) ... 0.000sec

edited Jun 01 '18 at 11:12

Frank

66,179
8
96
180

answered May 31 '18 at 14:24

mt1022

16,834
5
48
71

OP mentions aggregation so maybe doesn't want := / overwriting. Also, I guess there should be some way of getting GForce optimization to work here, but can't find an easy way to do it. (See ?GForce if not familiar with it.) – Frank May 31 '18 at 15:46
The solution `dt[, Map(function(f, x) f(x), funs, .SD), by = grp, .SDcols = unlist(scols)]` fits my needs and works like a charm. Thanks! – Blacksad Jun 01 '18 at 06:54
1

@Frank, don't know why but your comment didn't appear in my message inbox. You might forget it but you taught me about data.table optimization and how to examine the execution with `verbose = TRUE` before! I found data.table only recognizes `lapply(.SD, fun)` but not `Map` or `mapply` so that I have to construct an optimizable `j` expression manually. See the edited answer for details. – mt1022 Jun 01 '18 at 07:26
Yeah, i thought so but wasn't sure of my memory, so figured I'd give the ref again. Nice solution! (Just removed what I think was a typo. Please revert if I'm mistaken) – Frank Jun 01 '18 at 11:13
1

@Frank, thanks for the editing. I should have checked it again. – mt1022 Jun 01 '18 at 11:22

R data.table - Apply function A to some columns and function B to some others

1 Answers1

Linked