doParallel(): Caching values into a pre-defined environment [Windows]

Question

I would like to understand the "doParallel" package more. Im playing around with environments. I would like to create a global environment .someEnv<-new.env(parent = emptyenv())outside of foreach() %dopar%{…}

I know that foreach() needs to import all packages, functions, values and data to the foreach()-function using .export= “” and .packages=””.

My question is, is there a ways to import a Global environment into foreach(), read and write into this environment and use it as a way to cache calculations? (please no Comment on cacheing calculations using .Rdata, .RDS, .feather etc)

here is an example:

require(doParallel)
library(doSNOW)
getDoParWorkers()
getDoParName()
cl<-makeCluster(4, type = "SOCK")
registerDoSNOW(cl)
getDoParWorkers()
getDoParName()

#define environment
.someEnv<-new.env(parent =  emptyenv())
.someEnv$var<-1:10
.someEnv$squared<-matrix(nrow=10)

#define Function for "foreach"
do.something<-function(x)
{
  .someEnv$squared[x]<-.someEnv$var[x] *.someEnv$var[x]   
   return(.someEnv$squared[x])
}

foreach(i=1:10) %dopar% do.something(i)
stopCluster(cl)

Error Message:

Error in do.something(i) : 
  task 1 failed - "Objekt '.someEnv' nicht gefunden"

If you want to cache something that is common between parallel sessions, you need something on disk I'm afraid.. Or more complicated approach based on message passing. — F. Privé, Jun 28 '18 at 20:45

Alexis · Answer 1 · 2018-06-29T07:47:03.027

0

First of all, doSNOW and doParallel are two different packages that provide backends for foreach. You can certainly test both, just don't get confused. The following works for either, but your question is more closely related to the usage of the parallel package (which is included with R).

Your approach will not work because even if you put your environment in each parallel worker, they will be copies that are modified independently:

foreach(i = 1L:2L) %dopar% {
    .someEnv$hello <- "world"
    .someEnv$hello
}
[[1]]
[1] "world"

[[2]]
[1] "world"

print(.someEnv$hello)
NULL

However, you can use the bigmemory package:

library(doParallel)
library(bigmemory)

cl <- makeCluster(4L)
registerDoParallel(cl)

var <- 1L:10L

squared <- big.matrix(nrow = 10L, ncol = 1L, type = "integer")
# show by coercing to normal matrix
squared[,]
[1] NA NA NA NA NA NA NA NA NA NA

squared_desc <- describe(squared)
# assign it to each worker's global environment
clusterExport(cl, c("squared_desc"))

foreach(i = 1L:10L,
        .noexport = c("squared_desc"),
        .packages = "bigmemory") %dopar%
        {
          squared <- attach.big.matrix(squared_desc)
          squared[i] <- var[i] * var[i]
          NULL
        }

stopCluster(cl); registerDoSEQ()

squared[,]
[1]   1   4   9  16  25  36  49  64  81 100

Note that the matrices from bigmemory are strictly typed internally, so if you define them to be integers, you should assign to them values that are integers, which are explicitly specified in R by appending L at the end of a number, otherwise you'll get a warning about downcasting.

Also, you don't need to coerce the whole big.matrix to use it, but almost every time you access its elements you'll be copying some data into a regular R matrix/vector.

EDIT: and finally, I think bigmemory doesn't provide any synchronization mechanisms to protect against race conditions.

edited Jun 29 '18 at 07:47

answered Jun 28 '18 at 21:03

Alexis

4,950
1
18
37

Slightly off-topic, but also take into account any overhead you can remove by [chunking](https://stackoverflow.com/questions/50801635/doparallel-performance-on-a-tensor-in-r/50804071#50804071). – Alexis Jun 28 '18 at 21:20
I really appreciate your Answer! I would like to shoot you few more cases, if that’s fine with you. Case 1: `big.matrix()` accepts types “Integer”, “Double” and “Character”. Is there anouther method were I can use different types (like `list()` or `new.env()`)? Is `types=””` limited to “C”-code Variables? Case 2: Is there a way to create Shared memory inside of the `foreach()` loop for the cores? Or do I need to declare all the shared memory when creating the clusters? – Andy Parzyszek Jun 29 '18 at 06:51
I don't think you can use more types with `big.matrix`, although I suppose you could have a list of big matrices, but I don't think that can work for environments. You can create big matrices inside each worker (see e.g. `clusterEvalQ()`), but if it's supposed to be shared, why would you create it inside instead of in the main process? – Alexis Jun 29 '18 at 07:36
And if you do create it inside the workers, you'd have to return the description to the main process. – Alexis Jun 29 '18 at 07:52
Thanks Alexis! The question was generall one - like i mentioned, would like to understand the package better. I have one last question: What would be the equivilant of your code for Linux? `doParallel` if for Windows and `doMC` for Linux? and do i need to pass the Variable using something similar like `ClusterExport()`? – Andy Parzyszek Jul 03 '18 at 08:24
`doParallel` works on both Windows and Linux, I don't think you'd have to change anything. – Alexis Jul 11 '18 at 17:45

doParallel(): Caching values into a pre-defined environment [Windows]

1 Answers1