I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward. I'm running R version 4.0.3 in Windows.
- with
combn
, create a vector of unique combinations (m=3, 4, or 5) fromdf
variable names (i.e.,var1*var2*var3...var1*var2*varN
) - transform the vector of combinations into a list of formulas
- split the list of formulas into chunks to get around memory limitations (required to run step 4)
- iterate through each chunk from step 3 and perform the formula operation on the
df
. save the values resulting from the operations step in a separate list (ops_list_temp
) to use in step 5 - for each element in
ops_list_temp
, find the indices of the largest n values based on the user specifiedtopn
and save results toindices_list
- for each element in the
indices_list
, subset thedf
by the indices in eachindices_list
element and store the correspondingvalue
in thevalues_list
The full reprex is below including the different attempts using purrr::map
and base lapply
. I also attempted to use:=
from data.table
following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...))
:
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step (STEP 4). With m=4
and number of variables of 90, there are a total of 2,555,190 elements (RcppAlgos::comboCount(v = 90, m = 4, repetition = FALSE)
. Breaking this up into chunks of 10,000 to get around memory limitations results in a list of 256 elements.
With m=5
, there are 43,949,268 elements (RcppAlgos::comboCount(v = 90, m = 5, repetition = FALSE)
and a chunks list of ~4,440 elements.
A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit
and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn
of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn
of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
STEP 1:
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
STEP 2:
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
STEP 3:
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
STEP 4:
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
STEP 4 (cont):
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
STEP 5:
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
STEP 6:
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins