2

So I recently started the analysis of much bigger dataset than I used to and realised my code was not efficient enough in terms of speed. In order to make parts of my script go faster I decided to go parallel on some lapply call I do.

This is my original line, it works but it is so slow :

list_vect <- lapply(X = df1$start, function(x){

    vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)

})

My first attempt to go parallel was like this :

cl <- makeCluster(detectCores() -2) # 6 out of 8 cores

list_vect <- parLapply(cl, X = df1$start, function(x){

    vect_of_num <- which(df2$start <= x + 500 & df2$start >= x - 500)

})

Which produces an error telling me df2 doesn't exist

Following advices, I created the function outside :

get_list_vect <- function(var1, var2){

  vect_of_num <- which(var2$start <= var1 + 500 & var2$start >= var1 - 500)

}

cl <- makeCluster(detectCores() -2) # 6 out of 8 cores

list_vect <- parLapply(cl = cl, df1$start, df2$start, get_list_vect)

This last piece of code does run but I feel I did something wrong. When using lapply, I can see on the monitoring screen that the memory usage is steady (around 8go). However when calling the parLapply I see the memory usage increasing until it reach the maximum of 32Go and then the system freeze.

I hope you guys will see where I am wrong. Feel free to suggest a better approach.

Paul Endymion
  • 537
  • 3
  • 18
  • please provide some example data with your example to make a minimally reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – emilliman5 Aug 14 '19 at 14:19
  • As far as I understand R, your memory is ballooning because R copies the necessary objects for use on each core. So if `df2$start` is very large it is getting copied 6 times. – emilliman5 Aug 14 '19 at 14:23
  • Yes I thought about that but the table is not that big enough to overload my memory by being copied 6 times so I figured there was something in my approach that could be done better. – Paul Endymion Aug 14 '19 at 19:01

1 Answers1

1

A data.table approach using non-equi joins:

require(data.table)
# crate data.tables:
dt1 <- data.table(start = df1$start)
dt2 <- data.table(start = df2$start)
# calculate begining and end values
dt2[, beg := start - 500L]
dt2[, end := start + 500L]
# add indexes
dt1[, i1 := .I]
dt2[, i2 := .I]

# join:
x <- dt2[dt1, on = .(end >= start, beg <= start), allow.cartesian = T]
x[, .(.(i2)), keyby = i1]
#       i1                                V1
# 1:     1  788,1148,2511,3372,5365,8315,...
# 2:     2 2289,3551,4499,4678,4918,5008,...
# 3:     3     2319,3459,3834,5013,6537,9714
r <- x[, .(.(i2)), keyby = i1][[2]] # transform to list
all.equal(list_vect, r)
# TRUE
minem
  • 3,640
  • 2
  • 15
  • 29