Hello to everyone on SO! I rarely ask new questions since so much has already been said on this forum, but this time I cannot find enough material to get me through my performance issues.
I basically have survey data from which I want to compute various indicators at brand level. The trick is my need to create variations of my vectors for each element of the loop, by excluding all elements related to the i-th element tested.
At this moment I have not found a way to vectorize my code. Consequently, my lapply
loop is tediously slow (the slowest part of a bigger script, by far).
My dataset is 8 million rows long and I loop over 70 brands, so performance starts to matter at this point. See shorter reproducible example for your own tests:
(EDIT : Comments added to the script for better understanding.)
# Small sample size to experiment
N <- 200L
# Table with survey data :
# - each observation is the answer of a person about a brand
# - each observation is associated to a weight, used to compute statistics (frequencies, means...)
# - each person is the described by few socio-demographic variables (country, gender, age)
# - brands are given a grade ('score' variable), ranging from 0 to 10
repex_DT <- data.table (
country = factor(sample(c("COUNTRY 1", "COUNTRY 2", "COUNTRY 3", "COUNTRY 4"), size = N, replace=TRUE)),
gender = factor(sample(c("Male", "Female"), size = N, replace=TRUE)),
age_group = factor(sample(c("Less than 35", "35 and more"), size = N, replace=TRUE)),
brand = factor(sample(c(toupper(letters[1:26])), size = N, replace=TRUE)),
score = sample(x = c(0:10), size = N, replace=TRUE),
weight = sample(x = c(2/(1:10)), size = N, replace=TRUE)
)
# The loop computes for each "country x gender x age_group x brand" combination :
# - the "country x gender x age_group" socio-demographic group size (cases_total, i.e. all brands included)
# - the "country x gender x age_group" group size, excluding the people assessing the current 'brand_' argument
# - Same logic for mean and standard deviation indicators
current_loop <- data.table::rbindlist(l=lapply(unique(repex_DT$brand), function(brand_){
# Calculations done for each 'brand_' of each "country x gender x age_group" combination
out <- repex_DT[ , .(
cases_total = sum(x=weight, na.rm=TRUE),
cases_others = sum(x=weight[brand != brand_], na.rm=TRUE),
mean_others = expss::w_mean(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE),
sd_others = expss::w_sd(x=score[brand != brand_], weight=weight[brand != brand_], na.rm=TRUE)
), by = .(country, gender, age_group)]
out[, brand := brand_]
data.table::setcolorder(x=out, neworder="brand")
return(data.table::copy(x=out))})) %>%
# Sorting at the end for better readability
.[order(., country, gender, age_group, brand)]
So far I have read plenty of other SO questions like this one, this other one and others on the same topic, so I am well aware that loops extending a data.table is both memory and time consuming. Yet I haven't found an other way to get me where I want. Hope you R experts can :-)
And by the way, I use expss
to compute weighted means and standard deviations because I also use the package to compute tables here and there, but but I certainly could use other packages if that could help performance-wise.