Optimizing memory usage in applying a furrr function to a large list of tibbles: experiencing unexpected memory increase

Question

I am currently working on a task that involves applying a function to a fairly extensive list of tibbles, comprising approximately 30,000 elements. The code I'm using is as follows:

plan(multisession, workers=20)

hpar$input_df %>%
    group_by(key) %>% 
    group_split() %>%
    future_walk(sf_do_all_one_series, 
                hpar$df_stockout_decaying, 
                hpar$out_path, 
                hpar$out_prefix, 
                .env_globals = empty_env())

The function sf_do_all_one_series, which is called within the future_walk function, has the following structure:

sf_do_all_one_series <- function(df_1, df_stockout_decaying, out_path, out_prefix){
  
  key <- df_1 %>% distinct(key) %>% pull()
  
  do_stuff(df_1, df_stockout_decaying) %>% 
    do_more_stuff() %>% 
    write.table(file=paste0(out_path, out_prefix, "_", key, ".csv"), quote=FALSE, sep='\t', row.names=FALSE)
  
  invisible()
}

The hpar$input_df tibble consists of approximately 3 million records, and its size, calculated using object.size(.), is around 202 MB. On the other hand, hpar$df_stockout_decaying is a small tibble containing constant values. Lastly, hpar$out_path and hpar$out_prefix are character strings.

The issue I'm encountering is that the memory usage during this process shows a DRAMATIC increase over time, as if some intermediate output is being saved. I'm seeking guidance on understanding the potential cause of this memory increase and any possible solutions.

I have made several attempts to address the memory issue but haven't been successful so far. Here are the steps I've taken:

Removing intermediate objects: I tried minimizing the use of intermediate objects within the code to reduce memory consumption.
Garbage collector: I also explicitly called the garbage collector using the gc() function to free up any unused memory.
.env_globals = empty_env(): Although I don't believe the .env_globals = empty_env() parameter is necessary for future_walk, I still tried including it in the hope that it might have an impact.

Unfortunately, despite implementing these measures, memory usage has not shown any noticeable improvement.

I would greatly appreciate any suggestions or insights to help resolve this issue.

I can't point articulate exactly why, but I suspect that `write.table` may be the issue. Try [fwrite](https://www.rdocumentation.org/packages/data.table/versions/1.14.8/topics/fwrite) — pgcudahy, May 24 '23 at 10:32
hi, thanks for the answer. Actually i do not think the problem is the file writing, even without writing anything I was experiencing the same issue. `mcl_apply` seems to be less memory greedy — alexon, Jun 08 '23 at 07:38

Optimizing memory usage in applying a furrr function to a large list of tibbles: experiencing unexpected memory increase

0 Answers0

Linked