0

Let's say I have a data.frame

sample_df = structure(list(AE = c(148, 1789, 1223, 260, 1825, 37, 1442, 484, 
10, 163, 1834, 254, 445, 837, 721, 1904, 1261, 382, 139, 213), 
    FW = structure(c(1L, 3L, 2L, 3L, 3L, 1L, 2L, 3L, 2L, 2L, 
    3L, 2L, 3L, 2L, 1L, 3L, 1L, 1L, 1L, 3L), .Label = c("LYLR", 
    "OCXG", "BIYX"), class = "factor"), CP = c("WYB/NXO", "HUK/NXO", 
    "HUK/WYB", "HUK/NXO", "WYB/NXO", "HUK/WYB", "HUK/NXO", "HUK/NXO", 
    "WYB/NXO", "HUK/NXO", "WYB/NXO", "HUK/NXO", "HUK/WYB", "WYB/NXO", 
    "HUK/WYB", "WYB/NXO", "WYB/NXO", "HUK/WYB", "WYB/NXO", "WYB/NXO"
    ), SD = c(1, 1, -1, 1, 1, 1, 1, -1, 1, 1, -1, -1, 1, -1, 
    -1, 1, -1, 1, 1, 1)), .Names = c("AE", "FW", "CP", "SD"), row.names = c(NA, -20L), class = "data.frame")

Or in human readable format:

     AE   FW      CP SD
1   148 LYLR WYB/NXO  1
2  1789 BIYX HUK/NXO  1
3  1223 OCXG HUK/WYB -1
4   260 BIYX HUK/NXO  1
5  1825 BIYX WYB/NXO  1
6    37 LYLR HUK/WYB  1
7  1442 OCXG HUK/NXO  1
8   484 BIYX HUK/NXO -1
9    10 OCXG WYB/NXO  1
10  163 OCXG HUK/NXO  1
11 1834 BIYX WYB/NXO -1
12  254 OCXG HUK/NXO -1
13  445 BIYX HUK/WYB  1
14  837 OCXG WYB/NXO -1
15  721 LYLR HUK/WYB -1
16 1904 BIYX WYB/NXO  1
17 1261 LYLR WYB/NXO -1
18  382 LYLR HUK/WYB  1
19  139 LYLR WYB/NXO  1
20  213 BIYX WYB/NXO  1

now suppose that for each unique value (fw,cp) of (FW,CP), I would like to get

  • sum of all values of AE for (FW,CP)=(fw,cp)
  • mean of all values of SD for (FW,CP)=(fw,cp)

In R, one could do something like:

unique_keys  <- unique(sample_df[,c('FW','CP')])
slow_version <- function(ind, sample_df, unique_keys){
    index <- which(sample_df$FW == unique_keys$FW[ind] & sample_df$CP == unique_keys$CP[ind])
    c(ind    = ind,
               sum_ae = sum(sample_df$AE[index]), 
               min_ae = mean(sample_df$SD[index]))
}
intermed_result <- t(sapply(1:nrow(unique_keys), slow_version, 
                                      sample_df = sample_df, 
                                    unique_keys = unique_keys))
colnames(intermed_result) <- c('ind','sum','mean')
result <- data.frame(unique_keys[intermed_result[, 'ind'], ], 
                    'sum' = intermed_result[,'sum'], 
                    'mean' = intermed_result[,'mean'])

but this gets pretty slow as the size of data_frame grows.

Thanks to this answer, I suspect it is possible to use data.table magic to get the same result fastly. But doing:

library(data.table)
sample_dt = data.table(sample_df)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))} 
sample_dt[,c("col1","col2"):=f(AE, SD), by=.(FW, CP)][]

does not yield the desired result. What is the correct way?

Community
  • 1
  • 1
user189035
  • 5,589
  • 13
  • 52
  • 112
  • 2
    You'll run into syntax like `DT[, .(sae = sum(AE), msd = mean(SD)), by=.(FW,CP)]` early in the vignettes. Type `vignette(package="data.table")` to see a listing or look at the website for the package under "getting started". – Frank Apr 09 '17 at 02:32

1 Answers1

2

I would try:

library(data.table)
sample_dt = data.table(data_frame)
setkey(sample_dt, FW, CP)
f <- function(AE, SD) {list('sum' = sum(AE), 'mean' = mean(SD))} 

sample_dt[, f(AE, SD), by=.(FW, CP)]
#      FW      CP  sum       mean
# 1: LYLR HUK/WYB 1140  0.3333333
# 2: LYLR WYB/NXO 1548  0.3333333
# 3: OCXG HUK/NXO 1859  0.3333333
# 4: OCXG HUK/WYB 1223 -1.0000000
# 5: OCXG WYB/NXO  847  0.0000000
# 6: BIYX HUK/NXO 2533  0.3333333
# 7: BIYX HUK/WYB  445  1.0000000
# 8: BIYX WYB/NXO 5776  0.5000000

you didn't get desired output because you assign the resulting sum and mean columns by group to original data.table with :=. However, I also prefer the syntax suggested by Frank, which should be the right way to go. For our current named list approach, when adding verbose = T, it says:

Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

When we have many groups and the function in j are basic functions like mean and sd, using

sample_dt2[, .(sum.AE = sum(AE), mean.SD = mean(SD)), by=.(FW, CP)]

would be very fast, becaused those functions are replaced with GForce functions like gmean internally. see ?GForce and the benchmark of Frank for more information.

Community
  • 1
  • 1
mt1022
  • 16,834
  • 5
  • 48
  • 71
  • 1
    This is kind of esoteric but: by wrapping it in a function, you're preventing data.table from optimizing the call/query. See `?GForce` and try running your query with `verbose=TRUE` (which will warn about using a named list by group and show that GForce is not used, even though it could be here). The OP is complaining about speed, so I figure it's worth noting, anyway. – Frank Apr 09 '17 at 02:35
  • 1
    @Frank, thanks for pointing out that. It seem that when size of the data grows, `GRorce` would be faster, but the difference is very little for this data. see the benchmark results in my edited post. – mt1022 Apr 09 '17 at 02:54
  • 1
    It generally shows an improvement as the number of groups grows. I posted a benchmark in the R chat room: http://chat.stackoverflow.com/rooms/25312/r-public – Frank Apr 09 '17 at 03:11
  • Thanks! The benchmarks made me completely change my original approach! – user189035 Apr 09 '17 at 09:32