1

I found something very confusing when I use multiple processing to modify values in R data.table.

I tried to modify value in place by using a function. It works well using one core, and the values in data.table were successfully changed. But when I used multiple cores, it failed to change the value in data.table.

That makes me very confused. Anyone know why?

library(data.table)
library(parallel)
aa <- as.data.table(iris)
aa[,tt:=0]
# modify aa$tt in place
main <- function(x){
  #set(aa,x,6L,5)
  aa[x,tt:=5]
  return(NULL)
}

# aa$tt changed
mclapply(1:nrow(aa), main, mc.cores = 1)

# aa$tt unchanged
mclapply(1:nrow(aa), main, mc.cores = 2)

Even Guan
  • 33
  • 4
  • 1
    Think about what actually happens when code is run in parallel. If you understand that, it will be obvious why modification in place can't work in parallel. – Roland Sep 27 '19 at 06:29
  • 1
    You may want to read both the documentation for the `mclapply` function (https://www.rdocumentation.org/packages/parallel/versions/3.3.2/topics/mclapply) and `data.table`'s `setDTthreads` (https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/setDTthreads). In any case, your example `aa[, tt := 5]` will be extremely fast with `data.table` already, even over large data sets (see https://stackoverflow.com/questions/19082794/speed-up-data-table-group-by-using-multiple-cores-and-parallel-programming). – cddt Sep 27 '19 at 06:58
  • I use above example to show the different results between one core and multiple cores. The final data set is my purpose. ```aa[,t:=5]``` will be definitely enough for the above case. – Even Guan Sep 27 '19 at 07:14

1 Answers1

0

Short answer: Parallel sub processes work on copies of aa.

Longer answer:

mclapply uses forked "sub" processes (= mainly copies* of the parent process) and therefore work on copied data (aa in your case).

This means inplace changes of aa in a sub process do not modify aa in the main process.

See ?parallel::mclapply for details, eg. how to use the final result that is a return value (!).

*) In fact under Linux forking is implemented using copy-on-write memory pages to improve performance

R Yoda
  • 8,358
  • 2
  • 50
  • 87
  • Thanks for your reply. Does that mean data will not be copied if I use one core? – Even Guan Sep 27 '19 at 07:07
  • Yes, no forking no copying (by the OS). What R does is another thing (copy-on-write semantics) but since you are using `data.table` `:=` does NOT copy but overwrite in-place. So this is the best performance you can get. If you use the new function `setDTthreads` (credits go to @ccdt to mention this here!) you can even improve performance. See @arun's excellent update on new `data.table` features and performance benchmarks at the use!R 2019 conference: http://www.user2019.fr/static/pres/t258038.pdf – R Yoda Sep 27 '19 at 07:47