data.table operations with %dopar% are very slow

Question

I run a loop over elements of list grouped_data_list using foreach and dopar.

The runtime is terribly slow, while workers are visibly busy.

If I make a vectorized routine with lapply, and without parallelling, this takes seconds. What is wrong with my dopar?

library(data.table)
library('doParallel') # parallel cpu implementation
library('foreach') # parallel looping

grouped_data_dt <- data.table(
Who=c("thdeg","mjg","dfdf","system","df","system","system","hegha","ydvw")
, DocumentExtension=c("jpg","com","dug","182","27","pdf","png","xslt","53")
, What_Action=c("added","removed","added","added","added","removed","added","added","added")
, Date=as.Date(c("2017-11-08","2017-10-10","2017-09-14","2017-09-20","2017-09-21","2017-10-20","2017-10-19","2017-08-24","2017-09-17"))
, Count=c(1,2,3,4,5,6,7,8,9)
)

reported_date_seq_dt <- data.table(
reported_date_seq = as.Date(c(
"2017-08-23","2017-08-24","2017-08-25","2017-08-26","2017-08-27","2017-08-28","2017-08-29","2017-08-30","2017-08-31","2017-09-01","2017-09-02"
,"2017-09-03","2017-09-04","2017-09-05","2017-09-06","2017-09-07","2017-09-08","2017-09-09","2017-09-10","2017-09-11","2017-09-12","2017-09-13"
,"2017-09-14","2017-09-15","2017-09-16","2017-09-17","2017-09-18","2017-09-19","2017-09-20","2017-09-21","2017-09-22","2017-09-23","2017-09-24"
,"2017-09-25","2017-09-26","2017-09-27","2017-09-28","2017-09-29","2017-09-30","2017-10-01","2017-10-02","2017-10-03","2017-10-04","2017-10-05"
,"2017-10-06","2017-10-07","2017-10-08","2017-10-09","2017-10-10","2017-10-11","2017-10-12","2017-10-13","2017-10-14","2017-10-15","2017-10-16"
,"2017-10-17","2017-10-18","2017-10-19","2017-10-20","2017-10-21","2017-10-22","2017-10-23","2017-10-24","2017-10-25","2017-10-26","2017-10-27"
,"2017-10-28","2017-10-29","2017-10-30","2017-10-31","2017-11-01","2017-11-02","2017-11-03","2017-11-04","2017-11-05","2017-11-06","2017-11-07"
,"2017-11-08","2017-11-09","2017-11-10","2017-11-11","2017-11-12","2017-11-13","2017-11-14","2017-11-15","2017-11-16","2017-11-17","2017-11-18"
,"2017-11-19","2017-11-20","2017-11-21","2017-11-22","2017-11-23","2017-11-24","2017-11-25","2017-11-26","2017-11-27"
))
     )

grouped_data_list <- 
          split(x = grouped_data_dt
                , drop = T
                , by = c("Who", "DocumentExtension", "What_Action")
                , sorted = T
                , keep.by = T
          )


 cl <- makeCluster(4)
 registerDoParallel(cl)


 ## replace NA with zeros in the timeseries

 grouped_data_list_2 <- list()

 foreach(
      i = 1:length(grouped_data_list)
         ) %dopar%
 {

      x <- grouped_data_list[[i]]

      data.table::setkey(x, Date)

      dt_params <- unlist(
           x[1, -c('Date', 'Count'), with = F]
           )

      y <- x[reported_date_seq_dt]

      y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]

      y[is.na(Count), Count := 0]

      grouped_data_list_2 <- c(grouped_data_list_2
                               , list(y)
      )
 }

 stopCluster(cl)

lapply routine:

## after grouped_data_list is created

 rm(group_replace_func)

 group_replace_func <- function(x)
 {

      setkey(x, Date)

      dt_params <- unlist(
      x[1, -c('Date', 'Count'), with = F]
      )

      y <- x[reported_date_seq_dt]

      y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]

      y[is.na(Count), Count := 0]

      return(y)

 }

 grouped_data_list_2 <- lapply(
                          grouped_data_list
                          , group_replace_func
                          )

A new version that works fast (@Roland's advice):

## parallel work

     cl <- makeCluster(4)
     registerDoParallel(cl)


     ## replace NA with zeros in the timeseries

     grouped_data_list_2 <- list()

     grouped_data_list_2 <- foreach(
          x = grouped_data_list
             ) %dopar%
     {

          data.table::setkey(x, Date)

          dt_params <- unlist(
               x[1, -c('Date', 'Count'), with = F]
          )

          y <- x[reported_date_seq_dt]

          y[is.na(Count), (colnames(y)[!colnames(y) %in% c('Date', 'Count')]) := lapply(1:length(dt_params), function(x) dt_params[x])]

          y[is.na(Count), Count := 0]

          y

     }

     stopCluster(cl)

questions of efficiency should really be accompanied by a fully reproducible example — MichaelChirico, Jan 15 '18 at 14:13
Could you provide a [preproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Emmanuel-Lin, Jan 15 '18 at 14:15
I have added a small reproducible example (I actually run hunderds of times more data). — Alexey Burnakov, Jan 15 '18 at 14:36
Never split a data.table. That should absolutely not be necessary and will reduce efficiency. Do you know that data.table itself is partly parallelized? Anyway, I believe your code copies `grouped_data_list` for each operation. Why don't you do `foreach(x = grouped_data_list)`? — Roland, Jan 15 '18 at 15:32
Also, you can't grow an object in a parallel `foreach` loop. `foreach` is more similar to `lapply` than to `for`. The expression after `%dopar%` should look exactly like the body of `group_replace_func` (with `y` instead of `return(y)`). — Roland, Jan 15 '18 at 15:34
@Roland, "Do you know that data.table itself is partly parallelized?" No. Is that realized using BY arg? — Alexey Burnakov, Jan 15 '18 at 15:43
@Roland, "Anyway, I believe your code copies grouped_data_list for each operation. Why don't you do foreach(x = grouped_data_list)". I did not know how to do it (I don't use dopar as a normal way). — Alexey Burnakov, Jan 15 '18 at 15:45
@Roland, I added a new version of script. It really works fast now! Thank you! — Alexey Burnakov, Jan 15 '18 at 15:53
@Roland, **"Never split a data.table. That should absolutely not be necessary and will reduce efficiency. Do you know that data.table itself is partly parallelized?"** I tried to do all work inside DT[, ....., by]. It resulted in huge memory leak and inability of R to debug inside the loop. I then chose to split DT and do lapply. — Alexey Burnakov, Jan 15 '18 at 16:06
Please don't use phrases like "memory leak" unless you can provide proof. If you can, you should submit a bug report. Anyway, I can't comment on code I didn't see and test. — Roland, Jan 16 '18 at 06:51
"Please don't use phrases like "memory leak" unless you can provide proof. If you can, you should submit a bug report." In fact I can. Don't take it as the data.table insult, for I think it is my fault that the code became leaky. — Alexey Burnakov, Jan 16 '18 at 10:02
@Roland, I would use constructions like follows: result_dt <- data.table(); dt[filter, **x <- copy(.SD)**; transform x; **result_dt <<- rbind(result_dt, x)**, by = conditions]. I was told it was bad practice, referring to the copy and complex assignment, but it was the way I used anyway. Memory consumption was 18 Gb more than the volume of objects in my env. It was the leak. — Alexey Burnakov, Jan 16 '18 at 10:30
"bad practice" is an understatement. More like deliberately trying to break it or make it inefficient. If you plan on using data.table (or any other tool) you'd benefit from studying the documentation first. There are excellent [vignettes for data.table](https://github.com/Rdatatable/data.table/wiki/Getting-started). Memory consumption is not a memory leak. — Roland, Jan 16 '18 at 11:07
@Roland, "More like deliberately trying to break it or make it inefficient. " it is laughable. Why would I want to break my own Work? When memory consumption exceeds object size at the end of dt routine, how do you call it? Thank you for the reference. — Alexey Burnakov, Jan 16 '18 at 11:36
1. You managed to write the worst code using data.table I've ever seen. In fact, you carefully circumvent everything that makes it so efficient. 2. Memory consumption must exceed object size if you do anything with that object. A [memory leak](https://en.wikipedia.org/wiki/Memory_leak) means that memory that isn't needed is not released. — Roland, Jan 16 '18 at 11:43
@Roland, 1) got it already, not sure though you have understood why exactly this was made this way. 2) the operations that cause no trouble inside R's base loops, do cause troubles inside the dt[...] environment (complex assignment one of them, copying .SD is probably the other), while it is normal practice to use <<- with functions, and to copy objects when delete/transform operations should not affect the original. As a result what happens inside dt[..] (albeit written "terribly" in my case) is not being managed by either debugger or - probably - gc(). — Alexey Burnakov, Jan 16 '18 at 12:53
`data.table` is **not** a loop. Furthermore, you are making unsubstantiated claims, which are even less believable because your code shows a fundamental lack of understanding. Anyway, growing objects in loops is known to potentially result in performance and memory problems because it places a strong burden on the OS 's memory management resulting in fragmentation of RAM. Since a few R versions the JIT byte code compiler optimizes this away under specific circumstances and R has some limited support for over-allocating vectors now, but these don't work inside data.table. — Roland, Jan 16 '18 at 13:02
And of course, it is **not** normal practice to use `<<-` except for some specific purposes involving closures. Your code is not an example justifying use of `<<-`. In fact, I suggest you forget that this operator exists. You would write better code as a result. — Roland, Jan 16 '18 at 13:04
@Roland, let me create an issue at the DT github (with my reproducible example) and let's just see what answer I get from the Devs. On the **loop**, using .BY is the loop, isn't it? On **<<-**, I used it inside nested functions where I had to grow an object in a nested loop to be returned to GlobalEnv at the end of a master function. — Alexey Burnakov, Jan 16 '18 at 13:38
Reading your code hurts. Don't expect enthusiasm from the data.table devs. "I had to grow an object in a nested loop to be returned to GlobalEnv" is a false statement. You don't have to do that. And you shouldn't do that. And no, using `by` is not really a loop. You can use it as a loop, but you shouldn't. — Roland, Jan 16 '18 at 13:45
And to make it clear: Your code so far has not demonstrated a bug in data.table. If you use a screwdriver to put a nail into a wall and are unsuccessful, the screwdriver isn't broken. — Roland, Jan 16 '18 at 13:49
@Roland, let's just accept your initial answer which did help me. On the other points you have made, there is a lot of unnecessary epigrams and allusions I have read today in this thread related to my code which do not contain any direction and/or technical details (besides the link to the vignettes). When I reproduce the memory behavior that warned me and get some answer from the team we can return to the point again. have a good day. — Alexey Burnakov, Jan 16 '18 at 18:38
If you explained your actual problem, better solutions could be suggested. I'm confident that this could be done faster by orders of magnitude. — Roland, Jan 16 '18 at 18:40
@Roland, I have created another thread concerning the priors of my issue (including what we had discussed): https://stackoverflow.com/questions/47639739/r-memory-profiling-shows-much-lower-memory-usage-than-windows-resource-monitor-d However, it is not possible to easily create a reproducible example, so I have left it without actual data. — Alexey Burnakov, Jan 17 '18 at 12:42

data.table operations with %dopar% are very slow

0 Answers0