Optimize a large number of variable operations and variable ordering

Question

I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward. I'm running R version 4.0.3 in Windows.

with combn, create a vector of unique combinations (m=3, 4, or 5) from df variable names (i.e., var1*var2*var3...var1*var2*varN)
transform the vector of combinations into a list of formulas
split the list of formulas into chunks to get around memory limitations (required to run step 4)
iterate through each chunk from step 3 and perform the formula operation on the df. save the values resulting from the operations step in a separate list (ops_list_temp) to use in step 5
for each element in ops_list_temp, find the indices of the largest n values based on the user specified topn and save results to indices_list
for each element in the indices_list, subset the df by the indices in each indices_list element and store the corresponding value in the values_list

The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):

Apply a list of formulas to R data.table

It appears to me that one of the bottlenecks in my code is in variable operation step (STEP 4). With m=4 and number of variables of 90, there are a total of 2,555,190 elements (RcppAlgos::comboCount(v = 90, m = 4, repetition = FALSE). Breaking this up into chunks of 10,000 to get around memory limitations results in a list of 256 elements.

With m=5, there are 43,949,268 elements (RcppAlgos::comboCount(v = 90, m = 5, repetition = FALSE) and a chunks list of ~4,440 elements.

A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.

Fastest way to find second (third...) highest/lowest value in vector or column

library(purrr)
library(stringr)
library(kit)

df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4

STEP 1:
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")

STEP 2:
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))

STEP 3:
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))

## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))

STEP 4:
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
  STEP 4 (cont):
  ## using purrr::map
  # p <- Sys.time()
  ele_length <- length(chunks_run[[chunks_idx]])
  ops_list_temp <- vector("list", length = ele_length)
  ops_list_temp <- purrr::map(
    ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
  )
  # (p <- Sys.time()-p)  #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
  
  # ## using base lapply
  # p <- Sys.time()
  # ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
  # ops_list_temp <- vector("list", length = ele_length)
  # ops_list_temp <- lapply(
  #   ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
  # )
  # (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
  
  ## number of rows I want to subset from df
  topn <- 250
  
  ## list to store indices of topn values for each list element
  indices_list <- vector("list", length = length(ops_list_temp))
  
  ## list to store value of the topn indices for each list element
  values_list <- vector("list", length = length(ops_list_temp))
  
  STEP 5:
  ## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
  ## each element in this list should be the length of topn
  indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
  
  STEP 6:
  ## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
  ## each element in this list, should be the length of topn
  values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
  
  ## save completed chunk to final list
  chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins

One opportunity for improvement: if you have all combos of length 4, then you have `abcd`, `abce`, `abcf` and can precompute the `abc` part. You'd have to figure out how best to do that in R, though (eg, a loop to get all pairs; then all trios; then the final values). I don't see a shortcut for topn — Frank, Feb 01 '23 at 05:47
Is the example provided exactly the steps you need to execute in your real world problem, or more of a contrived "toy" example to make this more approachable? It seems like you could take advantage of treating each row as a vector, sorting it, and excluding all combinations except for a subset of largest and for smallest (most negative) values. — Matt Summersgill, Mar 09 '23 at 21:06
@MattSummersgill the example from step 1:6 is the code I execute on my real world problem, only the `df` is an example. There are a few more steps after 6 but the majority of the overhead as far as computing time is in these steps (that's why I have omitted the later steps from my question). The majority of the columns in my real data frame have positive values (>=0) with the exception of maybe 4 or 5 columns. I'm not sure I follow your suggestion 100% about row vectors (that's a me problem though as I'm more of a visual person). If you have a suggested code I'll gladly give it a try. Thanks! — On_an_island, Mar 10 '23 at 02:15

shs · Accepted Answer · 2023-03-10T13:36:17.177

3

When you are memory constrained, you need to avoid intermediate assignments of large objects.
In this case there is no reason to use names instead of integer indices to iterate over.
In step 1 you are paste()ing your values only to split them again in step 2, why?
When you need more speed, parallelization can be a way to go. Your problem is highly parallelizable, but it also increases memory usage, so your mileage may vary.

In the piece of code below, I took your problem and applied these lessons. On my fairly old i5 6267U dual core processor, it took 8 seconds to run the parallelized future_map() on 10000 observations. That is equivalent to one iteration of your for loop which takes 46 seconds on my machine. So this yields and approximate speedup of 6x. Since your loop was not parallelized, you may see an even greater increase if you have a more modern processor with more cores. The preparatory steps before that are also much faster.

library(stringr)
library(kit)
library(furrr)

plan(multisession)

# Parameters
tpn <- 250 # set topn n parameter
combination <- 4

# Data
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)

# Calculations
cols <- which(names(df) != "value") # indices for all columns but `value`
cbn <- combn(cols, combination, simplify = F) # combinations


result <- cbn |> 
  future_map(\(cb) df[, cb] |> # select the respective columns 
          Reduce(f = `*`) |> # rowwise product
          kit::topn(tpn) |>
          (\(x) df[x, "value"])() # select corresponding values
        )

Edit

On R 4.0 you can use this:

library(purrr)
result <- cbn %>% 
  future_map(function(cb) df[, cb] %>% 
               Reduce(f = `*`) %>%
               kit::topn(tpn, hasna = F) %>%
               `[`(df, ., "value")
             )

If the multithreading still does not work, replace future_map() with map() to run it sequentially. Also, when you are testing, you may want to restrict the data to a subset like cbn[seq_len(1e4)] so that you don't have to wait around for the whole thing to finish.

edited Mar 10 '23 at 13:36

answered Mar 08 '23 at 23:04

shs

3,683
1
6
34

looks like you're using a newer version of R? My R version is 4.0.3. My apologies. I should've noted that in my question. – On_an_island Mar 08 '23 at 23:22
I see now that `\()` is the syntax for `function()` – On_an_island Mar 09 '23 at 00:45
your code ran for over 1.5 hours and didn't finish so it appears to be slower on my laptop than what you're getting on your end. While I see the several `R for windows front-end session` running it appears that only one session is processing. I'm assuming this means your code is running sequentially on my laptop as opposed to running in parallel. – On_an_island Mar 09 '23 at 02:30
2

Support for futures works differently on different OS's. I don't have a windows machine on hand right now. But on my device it is still much faster to run my code sequentially than to run yours (~8s multisession, ~14s sequential, ~46s yours). The only explanation I have is that the improperly running multisession creates a lot of overhead and that gives you these poor results. You could try to replace `future_map()` with `purrr::map()` and see where sequential execution takes you. Alternatively, you could also achieve this by switching the evaluation strategy with `plan(sequential)`. – shs Mar 09 '23 at 09:40
2

And did you update your R version or did you replace the R 4.1 syntax to run on your older version? If you changed the code, please post it here. A change in the code may also contribute to the slower performance. – shs Mar 09 '23 at 09:43
i did not update to R 4.1 but I changed the syntax to the following: `result <- cbn %>% future_map(~function(cb) df[, cb] %>% Reduce(f = `*`) %>% kit::topn(tpn, hasna = F) %>% (~function(x) df[x, "value"])())`; you'll see I included the `hasna` argument of `kit::topn` as that speeds up the `kit` part of the code as noted in the link I included in my question. – On_an_island Mar 10 '23 at 02:06
1

And this code ran for you? I get an immediate syntax error because of the missing backticks in `Reduce(f = *)`. Also, in your attempt to convert the R 4.1 shorthand functions to something that works on your machine, you created some weird formula objects. You can use either purrr anonymous functions or base anonymous functions, not both. What you have does not work and probably caused your issues with my solution. I''ll edit a solution for R 4.0 into my answer – shs Mar 10 '23 at 13:28
Yes the code ran for me. It seems when you add a comment on here with backticks it removes them. I'll write the reduce line again to see if it will remove the backticks: `Reduce(f = `*`)` the code I ran had the backticks. I'll give your updated code a run later today when I'm on my laptop. Thanks! Edit: yes, the backticks were removed even though I just copied and pasted your line. – On_an_island Mar 10 '23 at 15:03
1

my bad. Inline code segments in markdown are delimited by backticks, that is why they were removed – shs Mar 10 '23 at 15:29
1

Did you test the code I edited into my answer? Even though the backticks were a red herring, my point about the formula objects still stands. I think it will be much faster than your original code – shs Mar 12 '23 at 17:48
I ran it today and I'm getting significant speed gains! using m=4 & `furrr::future_map` runtime went from ~40 min to ~18 min. The nice thing is that I can apply the same parallelization to other parts of my code as they follow the same structure. For `m=5` resulting in 40+ million combos I'll still have to break it up into chunks because it eventually hits a memory error. As a side note, I used the `~function()` syntax because the `furrr` documentation use it in the examples but apparently that was creating formula objects instead of executing the function? – On_an_island Mar 12 '23 at 21:55
1

The `tidyverse` has its own form of anonymous functions based on formulas. In functions compatible with it, `function (x) sum(x)` is the same as `~ sum(.x)`. The issue wasn't that you were trying to use them, but that you were using them incorrectly. – shs Mar 13 '23 at 08:32

Optimize a large number of variable operations and variable ordering

1 Answers1

Edit

Linked