0

I am trying to use parLapply to run a custom function. Since my actual code and data is not very reader friendly, I am creating a pseudo code for reference. I do the following:

a) First, I create a custom function. This function takes an argument say "Argument1". Argument1 is a list object which is what I use to run the parLapply on later.

b) Inside the function, based on Argument1, I create a subset called subset_data (subsetting on the full dataset which is supplied while calling parLapply).

c) After getting subset_data, I obtain a list of unique items for Variable2 and then further subset it depending on the number of unique items in Variable2.

d) Finally I run a function (SomeOtherFunction) which takes subset_data2 as the argument.

SomeCustomFunction = function(Argument1){
   subset_data = OriginalData[which(OriginalData$Variable1==Argument1),]
   
   some_other_variable = unique(subset_data$Variable2)
   
   for (object in some_other_variable){
      subset_data2 = subset_data[which(subset_data$Variable2 == object),]
      
      FinalOutput = SomeOtherFunction(subset_data2)
   }
   return(SomeOutput)
}

SomeOtherFunction=function(subset_data2){
   #Do Some computation here 
}

Next I can create clusters in this way:

cl=parallel::makeCluster(2,type="PSOCK")
registerDoParallel(cl)

And supply the objects Argument1, OriginalData by calling clusterExport and then finally run parLapply by supplying SomeCustomFunction and a list for Argument1 (suppose Argument1_list).

clusterExport(cl=cl, list("Argument1","OriginalData"),envir=environment())
zz=parLapply(cl=cl,fun=SomeCustomFunction,Argument1=Argument1_list)

However, in this case, when I run parLapply, I get an error saying

Error in get(name, envir = envir) : object 'subset_data2' not found

In this case, I was assuming that since subset_data2 is being created within the first function, the object subset_data2 will get supplied automatically. Clearly this is not happening.

Is there a way for me supply this 2nd subset (subset_data2) within the function SomeCustomFunction without passing it to the cluster when calling ClusterExport?

If the question is not clear, please let me know and I can modify it accordingly. Thanks in advance.

P.S. I read this question: using parallel's parLapply: unable to access variables within parallel code, but in my case I do not call parLapply inside my function.

Prometheus
  • 673
  • 3
  • 25

1 Answers1

2

In the related question you mention, the top answer passes clusterExport a character vector of variable names, whereas you pass a list. Also, help(clusterExport) reveals: "varlist: character vector of names of objects to export".

Also, you're missing a " after Argument1 here: list("Argument1,"OriginalData, but I'm guessing that's only the sample code you posted, not in your real code.

PS: It's a step in the right direction that you put some code, but your question will get more responses if you put sample data and code that can be directly pasted and run to reproduce the error.

webb
  • 4,180
  • 1
  • 17
  • 26
  • Good catch. I made the edit. Thanks. In my case, I am trying to parallelize computations for each element of the list that I pass. This approach works when I do not call the function from within the parallel version of the code. Originally, I had the contents of "SomeOtherFunction" inside the body of the CustomFunction. This was very slow as it has to make several comparisons (including subsetting, comparing row values etc.). In order to increase the speed, I turned it into a separate function and where I am having trouble with right now is passing the argument to this 2nd function. – Prometheus Jun 25 '20 at 19:25