New data.table columnS based on grouping and function of multiple columns

Question

Let's say I have a data.frame

sample_df = structure(list(AE = c(148, 1789, 1223, 260, 1825, 37, 1442, 484, 
10, 163, 1834, 254, 445, 837, 721, 1904, 1261, 382, 139, 213), 
    FW = structure(c(1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 3L), .Label = c("LYLR", 
    "OCXG", "BIYX"), class = "factor"), CP = c("WYB/NXO", "HUK/NXO", 
    "HUK/WYB", "HUK/NXO", "WYB/NXO", "HUK/WYB", "HUK/NXO", "HUK/NXO", 
    "WYB/NXO", "HUK/NXO", "WYB/NXO", "HUK/NXO", "HUK/WYB", "WYB/NXO", 
    "HUK/WYB", "WYB/NXO", "WYB/NXO", "HUK/WYB", "WYB/NXO", "WYB/NXO"
    ), SD = c(1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, -1, 
    -1, 1, -1, 1, 1, 1)), .Names = c("AE", "FW", "CP", "SD"), row.names = c(NA, -20L), class = "data.frame")

Or in human readable format:

     AE   FW      CP SD
1   148 LYLR WYB/NXO  1
2  1789 BIYX HUK/NXO  1
3  1223 OCXG HUK/WYB -1
4   260 BIYX HUK/NXO  1
5  1825 BIYX WYB/NXO  1
6    37 LYLR HUK/WYB  1
7  1442 OCXG HUK/NXO  1
8   484 BIYX HUK/NXO -1
9    10 OCXG WYB/NXO  1
10  163 OCXG HUK/NXO  1
11 1834 BIYX WYB/NXO -1
12  254 OCXG HUK/NXO -1
13  445 BIYX HUK/WYB  1
14  837 OCXG WYB/NXO -1
15  721 LYLR HUK/WYB -1
16 1904 BIYX WYB/NXO  1
17 1261 LYLR WYB/NXO -1
18  382 LYLR HUK/WYB  1
19  139 LYLR WYB/NXO  1
20  213 BIYX WYB/NXO  1

now suppose that for each unique value (fw,cp) of (FW,CP), I would like to get

sum of all values of AE for (FW,CP)=(fw,cp)
mean of all values of SD for (FW,CP)=(fw,cp)

In R, one could do something like:

unique_keys  <- unique(sample_df[,c('FW','CP')])
slow_version <- function(ind, sample_df, unique_keys){
    index <- which(sample_df$FW == unique_keys$FW[ind] & sample_df$CP == unique_keys$CP[ind])
    c(ind    = ind,
               sum_ae = sum(sample_df$AE[index]), 
               min_ae = mean(sample_df$SD[index]))
}
intermed_result <- t(sapply(1:nrow(unique_keys), slow_version, 
                                      sample_df = sample_df, 
                                    unique_keys = unique_keys))
colnames(intermed_result) <- c('ind','sum','mean')
result <- data.frame(unique_keys[intermed_result[, 'ind'], ], 
                    'sum' = intermed_result[,'sum'], 
                    'mean' = intermed_result[,'mean'])

but this gets pretty slow as the size of data_frame grows.

Thanks to this answer, I suspect it is possible to use data.table magic to get the same result fastly. But doing:

library(data.table)
sample_dt = data.table(sample_df)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))} 
sample_dt[,c("col1","col2"):=f(AE, SD), by=.(FW, CP)][]

does not yield the desired result. What is the correct way?

You'll run into syntax like `DT[, .(sae = sum(AE), msd = mean(SD)), by=.(FW,CP)]` early in the vignettes. Type `vignette(package="data.table")` to see a listing or look at the website for the package under "getting started". — Frank, Apr 09 '17 at 02:32

score 2 · Accepted Answer · edited Jan 18 '21 at 12:37

I would try:

library(data.table)
sample_dt = data.table(data_frame)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))} 

sample_dt[, f(AE, SD), by=.(FW, CP)]
#      FW      CP  sum       mean
# 1: LYLR HUK/WYB 1140  0.3333333
# 2: LYLR WYB/NXO 1548  0.3333333
# 3: OCXG HUK/NXO 1859  0.3333333
# 4: OCXG HUK/WYB 1223 -1.0000000
# 5: OCXG WYB/NXO  847  0.0000000
# 6: BIYX HUK/NXO 2533  0.3333333
# 7: BIYX HUK/WYB  445  1.0000000
# 8: BIYX WYB/NXO 5776  0.5000000

you didn't get desired output because you assign the resulting sum and mean columns by group to original data.table with :=. However, I also prefer the syntax suggested by Frank, which should be the right way to go. For our current named list approach, when adding verbose = T, it says:

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

When we have many groups and the function in j are basic functions like mean and sd, using

sample_dt2[, .(sum.AE = sum(AE), mean.SD = mean(SD)), by=.(FW, CP)]

would be very fast, becaused those functions are replaced with GForce functions like gmean internally. see ?GForce and the benchmark of Frank for more information.

This is kind of esoteric but: by wrapping it in a function, you're preventing data.table from optimizing the call/query. See `?GForce` and try running your query with `verbose=TRUE` (which will warn about using a named list by group and show that GForce is not used, even though it could be here). The OP is complaining about speed, so I figure it's worth noting, anyway. — Frank, Apr 09 '17 at 02:35
@Frank, thanks for pointing out that. It seem that when size of the data grows, `GRorce` would be faster, but the difference is very little for this data. see the benchmark results in my edited post. — mt1022, Apr 09 '17 at 02:54
It generally shows an improvement as the number of groups grows. I posted a benchmark in the R chat room: http://chat.stackoverflow.com/rooms/25312/r-public — Frank, Apr 09 '17 at 03:11
Thanks! The benchmarks made me completely change my original approach! — user189035, Apr 09 '17 at 09:32

New data.table columnS based on grouping and function of multiple columns

1 Answers1