R lapply loop performance with vector subsets

Question

Hello to everyone on SO! I rarely ask new questions since so much has already been said on this forum, but this time I cannot find enough material to get me through my performance issues.

I basically have survey data from which I want to compute various indicators at brand level. The trick is my need to create variations of my vectors for each element of the loop, by excluding all elements related to the i-th element tested. At this moment I have not found a way to vectorize my code. Consequently, my lapply loop is tediously slow (the slowest part of a bigger script, by far).

My dataset is 8 million rows long and I loop over 70 brands, so performance starts to matter at this point. See shorter reproducible example for your own tests:

(EDIT : Comments added to the script for better understanding.)

# Small sample size to experiment
N <- 200L 

# Table with survey data :
# - each observation is the answer of a person about a brand
# - each observation is associated to a weight, used to compute statistics (frequencies, means...)
# - each person is the described by few socio-demographic variables (country, gender, age)
# - brands are given a grade ('score' variable), ranging from 0 to 10
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(c(toupper(letters[1:26])), size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# The loop computes for each "country x gender x age_group x brand" combination :
# - the "country x gender x age_group" socio-demographic group size (cases_total, i.e. all brands included)
# - the "country x gender x age_group" group size, excluding the people assessing the current 'brand_' argument
# - Same logic for mean and standard deviation indicators

current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]

So far I have read plenty of other SO questions like this one, this other one and others on the same topic, so I am well aware that loops extending a data.table is both memory and time consuming. Yet I haven't found an other way to get me where I want. Hope you R experts can :-) And by the way, I use expss to compute weighted means and standard deviations because I also use the package to compute tables here and there, but but I certainly could use other packages if that could help performance-wise.

Could you explain in words what your code is doing? I'd also recommend re-doing your sample data creation to paramterize `N <- 500000L` - it would be nice to be able to easily experiment with a small example before jumping to a large one. — Gregor Thomas, Mar 06 '22 at 17:37
And have you profiled your code at all? Within the loop, what are the most expensive operations? I just ran a quick comparison and it looks like `expss::w_mean` is about 3 times slower than the built-in `stats::weighted.mean()` (tested on a vector of length 1M), but before profiling we can't know if the mean and SD calculations are a bottleneck--optimizing them would be pretty pointless if they make up less than 1% of the compute time. — Gregor Thomas, Mar 06 '22 at 17:42
@GregorThomas Sure I will edit and add a bit of context. Regarding your second comment, I compute my statistics on hundreads of thousands records and it runs smootly, at least performance is not an issue. The same operations inside a loop as you see above and its about a minute looking at the screen waiting for the calculations to end. — Maxence Dum., Mar 06 '22 at 17:48
But it looks like you're doing fairly simple leave-one-brand-out mean and sd calculations, which can definitely be optimized. For every single brand, you add up all the other brands - you could instead do group sums for each brand one time and subtract each brand's total from the grand total to know the leave-one-out sum. Do that for both `score` and `score^2` (and the weights) to make an aggregate table with one row per brand, and you should be able to calculate means and SDs fairly easily. I don't have time to develop an answer now, but if help is still needed later I'll try to take a look. — Gregor Thomas, Mar 06 '22 at 17:51
I was curious and just ran a comparison between `sqrt(Hmisc::wtd.var())` and `expss::w_sd` using `n = 1e6; x = rnorm(n); w = sample(1:500, n, replace = T)` wiht code `bench::mark( sqrt(wtd.var(x, w)), w_sd(x, w))`. The `Hmisc` version is about 6-7 times faster and uses 1/3 of the memory of the `expss` version. So as very easy first step you can swap out `w_mean` for `weighted.mean` and `w_sd` for `sqrt(Hmisc::wtd.var())`. — Gregor Thomas, Mar 06 '22 at 18:48
did you tried: `options(datatable.optimize = 1L )`, to turn on in-built performance of data.table. I tried for the N=200L, and it gives: ** user system elapsed ** 0.426 0.000 0.426 (0L, OFF) 0.276 0.000 0.277 (1L, ON) — BrunoPLC, Mar 06 '22 at 20:34
Thank you both for your input! @GregorThomas I have tested your suggestions with microbenchmark and a bigger sample size `(N <- 500000L)`, it sure helps but I wouldn't say that is the bottleneck of the code. Results of microbenchmark (times=100) : `Unit: seconds expr mean median max neval current_loop 3.103874 3.075840 3.990417 100 optimized_loop 2.678985 2.613215 3.465028 100` @BrunoPLC I will give a try to this DT option which I did not know. — Maxence Dum., Mar 07 '22 at 07:48

score 1 · Answer 1 · answered Mar 07 '22 at 07:53

While not addressing the vectorization issue, using the collapse package can be more efficient and lead to a nice speedup (YMMV, depending on the number of available cores):

invisible(suppressPackageStartupMessages(
    lapply(c("magrittr", "expss", "data.table", "collapse"), require, character.only=TRUE)))
options(datatable.optimize = 3L)
N <- 1E7
repex_DT <- data.table (
  country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
  gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
  age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
  brand = factor(sample(LETTERS, size = N, replace=TRUE)),
  score = sample(x = c(0:10), size = N, replace=TRUE),
  weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)

# your version
system.time({
    current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
  # Calculations done for each 'brand_' of each "country x gender x age_group" combination
  out <- repex_DT[ , .(
    cases_total = sum(x=weight, na.rm=TRUE),
    cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
    mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
    sd_others  = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
  ), by = .(country, gender, age_group)]
  
  out[, brand := brand_]
  data.table::setcolorder(x=out, neworder="brand")
  return(data.table::copy(x=out))})) %>% 
  
  # Sorting at the end for better readability
  .[order(., country, gender, age_group, brand)]
})
#>    user  system elapsed 
#>  95.612   1.557  49.309

# version using collapse
system.time({
cols <- c("country", "gender", "age_group")
ot <- repex_DT[ , .(cases_total = sum(weight, na.rm=TRUE)), by = cols]
ot2 <- data.table::setorder(rbindlist(lapply(setNames(levels(repex_DT$brand), levels(repex_DT$brand)), 
    \(i) repex_DT[brand != i][, .(cases_others = fsum(x=weight, na.rm=TRUE),
    mean_others = fmean(score, w=weight, na.rm=TRUE),
    sd_others = fsd(score, w=weight, na.rm=TRUE)), by = cols]), 
    idcol="brand"), country, gender, age_group, brand)
out <- data.table::setcolorder(ot[ot2, on=cols], "brand")
out[, brand := factor(brand)]
})
#>    user  system elapsed 
#>  49.836   3.478  11.543

all.equal(current_loop, out)
#> [1] TRUE

^{Created on 2022-03-07 by the reprex package (v2.0.1)}

Thank you for your suggestion, which seem promising. Not very familiar with the collapse package, definitely needs to investigate since time is split in half using the same lapply loop. However I cannot run your version, I've got errors with unexpected symbols. Maybe because of the `\(i)` loop index? — Maxence Dum., Mar 07 '22 at 08:36
Sorry, just replace `\(i)` with `function(i)` if you are using an R version < R 4.1. — user12728748, Mar 07 '22 at 08:51
I tried your suggestion, and indeed `collapse` is faster than both `expss` and `Hmisc` functions. I cannot accept the answer since @langtang gave the vectorized approach I was desperate to find (which is increadibly faster), but I tweaked it even further by using collapse functions, so you deserve your part in this achievement. Thanks! — Maxence Dum., Mar 07 '22 at 18:07

score 1 · Accepted Answer · answered Mar 07 '22 at 14:59

Here is a much faster approach

Get the n,mean,sd for total

cases_total = repex_DT[, .(cases_total = sum(weight, na.rm=T),
                           mean_total = expss::w_mean(score, weight, na.rm=T),
                           sd_total = expss::w_sd(score, weight, na.rm=T)), .(country, gender, age_group)]

Get the n, mean, sd for each brand

cases_brand = repex_DT[, .(cases_brand = sum(weight, na.rm=T),
                           mean_brand = expss::w_mean(score, weight, na.rm=T),
                           sd_brand = expss::w_sd(score, weight, na.rm=T)), .(brand,country, gender, age_group)]

Merge these together

result = cases_brand[cases_total, on=.(country, gender, age_group)][, cases_others:=cases_total-cases_brand]

Very easy to get mean for the the "others" (i.e. non-brand)

result[, mean_others:= (cases_total*mean_total - cases_brand*mean_brand)/cases_others]

Now, a little function to get the sd of the "others"

sd_other <- function(n1,n2,total_sd,sub_sd,total_mean, sub_mean ,other_mean) {
  sqrt(
    (total_sd^2*(n1+n2-1) - (n1-1)*sub_sd^2 - n1*(sub_mean-total_mean)^2 - n2*(other_mean-total_mean)^2)/(n2-1)
  )
}

Apply that function to get the sd of the others

result[, sd_others:= sd_other(cases_brand, cases_others,sd_total,sd_brand,mean_total,mean_brand, mean_others)]

Drop unnecessary columns and set order

result[, `:=`(cases_brand=NULL, mean_brand=NULL, sd_brand=NULL, mean_total=NULL, sd_total=NULL)]
setorder(result, country, gender, age_group, brand)

Comparison:

> microbenchmark::microbenchmark(list=cmds, times=10)
Unit: milliseconds
         expr       min        lq      mean    median        uq       max neval
 current_loop 3684.4233 3700.1134 3775.4322 3729.8387 3855.5353 3938.4605    10
 new_approach  155.9486  158.2265  164.1699  165.9736  167.5279  174.0746    10

You nailed it 100%! I tried this vectorized approach and, with no surprise, it is way faster than other loop-based suggestions (which were still faster than mine though). Thanks a lot! — Maxence Dum., Mar 07 '22 at 18:03

R lapply loop performance with vector subsets

2 Answers2