So I recently started the analysis of much bigger dataset than I used to and realised my code was not efficient enough in terms of speed.
In order to make parts of my script go faster I decided to go parallel on some lapply
call I do.
This is my original line, it works but it is so slow :
list_vect <- lapply(X = df1$start, function(x){
vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)
})
My first attempt to go parallel was like this :
cl <- makeCluster(detectCores() -2) # 6 out of 8 cores
list_vect <- parLapply(cl, X = df1$start, function(x){
vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)
})
Which produces an error telling me df2 doesn't exist
Following advices, I created the function outside :
get_list_vect <- function(var1, var2){
vect_of_num <- which(var2$start <= var1 + 500 & var2$start >= var1 - 500)
}
cl <- makeCluster(detectCores() -2) # 6 out of 8 cores
list_vect <- parLapply(cl = cl, df1$start, df2$start, get_list_vect)
This last piece of code does run but I feel I did something wrong. When using lapply, I can see on the monitoring screen that the memory usage is steady (around 8go). However when calling the parLapply I see the memory usage increasing until it reach the maximum of 32Go and then the system freeze.
I hope you guys will see where I am wrong. Feel free to suggest a better approach.