4

Let's suppose that I want to apply, in a parallel fashion, myfunction to each row of myDataFrame. Suppose that otherDataFrame is a dataframe with two columns: COLUNM1_odf and COLUMN2_odf used for some reasons in myfunction. So I would like to write a code using parApply like this:

clus <- makeCluster(4)
clusterExport(clus, list("myfunction","%>%"))

myfunction <- function(fst, snd) {
 #otherFunction and aGlobalDataFrame are defined in the global env
 otherFunction(aGlobalDataFrame)

 # some code to create otherDataFrame **INTERNALLY** to this function
 otherDataFrame %>% filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
 return(otherDataFrame)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r) { myfunction(r[1],r[2]) }

The problem here is that R doesn't recognize COLUMN1_odf and COLUMN2_odf even if I insert them in clusterExport. How can I solve this problem? Is there a way to "export" all the object that snow needs in order to not enumerate each of them?

EDIT 1: I've added a comment (in the code above) in order to specify that the otherDataFrame is created interally to myfunction.

EDIT 2: I've added some pseudo-code in order to generalize myfunction: it now uses a global dataframe (aGlobalDataFrame and another function otherFunction)

enneppi
  • 1,029
  • 2
  • 15
  • 33
  • 1
    Arguments of your `myFunction` should hold all objects. Try `myFunction <- function(otherDataFrame, fst, snd) {...`. – Roman Luštrik Oct 23 '16 at 07:12
  • in reality otherDataFrame is an object create in myfunction so I cannot pass to it. – enneppi Oct 23 '16 at 09:38
  • Rather than using `parApply`, you can use `parLapply`. `do.call(bind_rows,parLapply(clus,1:nrow (myDataFrame),function(i, r) { myfunction(r[i,1],r[i,2]) }`. (I haven't tested this. It may still need some fiddling) – Benjamin Oct 23 '16 at 10:48
  • My point still stands - pass all needed objects to the function without referencing to them from outside. Outside is empty once you spawn extra R parallel processes. – Roman Luštrik Oct 23 '16 at 11:14
  • but it is inside the function that I create otherDataFrame: inevitably i cannot pass it to the function. otherDataFrame is an object create inside the function using the parameters passed to the function itself – enneppi Oct 23 '16 at 14:10
  • @Benjamin I'll check your suggestion asap. but why parLapply should work? What is the difference about parLapply in this particular case? – enneppi Oct 23 '16 at 14:15

2 Answers2

5

Done some experiments, so I solved my problem (with the suggestion of Benjamin and considering the 'edit' that I've added to the question) with:

clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction", "otherfunction", aGlobalDataFrame)

myfunction <- function(fst, snd) {
 #otherFunction and aGlobalDataFrame are defined in the global env
 otherFunction(aGlobalDataFrame)

 # some code to create otherDataFrame **INTERNALLY** to this function
 otherDataFrame %>% dplyr::filter(COLUMN1_odf==fst & COLUMN2_odf==snd)
 return(otherDataFrame)
}

do.call(bind_rows, parApply(clus, myDataFrame, 1, 
        {function(r) { myfunction(r[1], r[2]) } )

In this way I've registered aGlobalDataFrame, myfunction and otherfunction, in short all the function and the data used by the function used to parallelize the job (myfunction itself)

enneppi
  • 1,029
  • 2
  • 15
  • 33
1

Now that I'm not looking at this on my phone, I can see a couple of issues.

First, you are not actually creating otherDataFrame in your function. You are trying to pipe an existing otherDataFrame into filter, and if otherDataFrame doesn't exist in the environment, the function will fail.

Second, unless you have already loaded the dplyr package into your cluster environments, you will be calling the wrong filter function.

Lastly, when you've called parApply, you haven't specified anywhere what fst and snd are supposed to be. Give the following a try:

clus <- makeCluster(4)
clusterEvalQ(clus, {library(dplyr); library(magrittr)})
clusterExport(clus, "myfunction")

myfunction <- function(otherDataFrame, fst, snd) {
 dplyr::filter(otherDataFrame, COLUMN1_odf==fst & COLUMN2_odf==snd)
}
do.call(bind_rows,parApply(clus,myDataFrame,1,function(r, fst, snd) { myfunction(r[fst],r[snd]), "[fst]", "[snd]") }
Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • i'll try it asap...even though otherDataFrame is created internally to the function (read my last edit) your suggestion is a good way to follow – enneppi Oct 24 '16 at 11:14