3

I have a big data.table. Each parallel process reads from it, processes the data and returns a much smaller data.table. I don't want the big DT to be copied to all processes, but seems the %dopar% function in foreach package has to copy.

Is there a way to have the object shared across all processes (in windows)? That is, by using a package other than foreach.

Example code

library(doParallel)
cluster = makeCluster(4)
registerDoParallel(cluster)

M = 1e4 # make this larger 
dt = data.table(x = rep(LETTERS, M), y = rnorm(26*M))
res = foreach(trim = seq(0.6, 0.95, 0.05), .combine = rbind) %dopar% {
  dt[, .(trimmean = mean(y, trim = trim)), by = x][, trim := trim]
}

(I'm not interested in a better way of doing this in data.table without using parallel. This is just to show the case that subprocesses need to read all the data to process, but never change it)

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
jf328
  • 6,841
  • 10
  • 58
  • 82
  • http://stackoverflow.com/questions/31575585/shared-memory-in-parallel-foreach-in-r – admccurdy Mar 03 '16 at 17:18
  • That's where I got the information that `foreach` has to copy. I'm seeking other possibilities – jf328 Mar 04 '16 at 08:54
  • I typically use snow for parallel coding and haven't run into a problem with memory so if I'm confused here let me know. In your code dt is getting altered in each iteration of foreach so as the link I posted it needs to be copied altered and then returned. It sounds like if you are assigning the results of an operation to another object it won't be copied by every process but only read. Now I'm not sure how that would work with data.table's behaviour of changing structures in place...maybe try the same task dplyr and assign to a different object to see if there is a difference. – admccurdy Mar 04 '16 at 15:39
  • @AdamMccurdy, dt is not changed in subprocess, it is only read from. The first [] returns a new data.table and then the new one is modified in the second []. – jf328 Mar 04 '16 at 19:27

1 Answers1

3

Since R isn't multithreaded, parallel workers are implemented as processes in the various parallel programming packages. One of the features of processes is that their memory is protected from other processes, so programs have to use special mechanisms to share memory between different processes, such as memory mapped files. Since R doesn't have direct, builtin support for any such mechanism, packages such as "bigmemory" have been written that let you create objects that can be shared between different processes. Unfortunately, the "data.table" package doesn't support such a mechanism, so I don't think there is a way to do what you want.

Note that memory can be "read-only" shared between a process and a forked child process on Posix operating systems (such as Mac OS X and Linux), so you could sort of do what you want using the "doMC" backend, but that doesn't work on Windows, of course.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75