2

I am relatively new to parallel processing in R. Have been playing around with some code when I stumbled into a slight problem - the code in the foreach loop does not seem to manipulate certain variables / data frames, whilst actually making some predictions (aim of the program). My code is as follows -

library(parallel)
library(doParallel)
library(foreach)
library(iterators)

# select people who visited and who did not visit separately
vis <- b[which(b$visit==1),]
no <- not <- b[which(b$visit!=1),]

# create parallel processing environment for 2 (TWO) processors
cl<-makeCluster(2)
registerDoParallel(cl)

iterations <- 3
#k=f=1
predictions <- foreach(icount(iterations), .combine=cbind) %dopar% {

   # randomly select training & testing set from Visited customers
   pos <- sample(nrow(vis),size=floor(nrow(vis)/10*7),replace=FALSE)
   train<- vis[pos,]
   test<- vis[-pos,]

   # create distinct and non-repeatable bags for Non-visited customers
   sel <- sample(nrow(not), size=9246, replace=FALSE)
   #train1 <-1:nrow(not) %in% sel
   no <- not[sel,]
   not <- not[-sel,]

   # randomly select training & testing set from Non-Visited customers
   pos1 <- sample(nrow(no),size=floor((nrow(no)/10)*7),replace=FALSE)
   trainNo <- no[pos1,]
   testNo <- no[-pos1,]

   # combine the train & test Bags of both Visit & Non-Visit customers
   trainSet <- rbind(train,trainNo)
   testSet <- rbind(test,testNo)

   fit <- glm(visit~., data=trainSet, family=binomial(logit))

   #pr <- 
   print(length(not))
   predict(fit,testSet[,-10])
   #pr <- rbind(pr,predict(fit,testSet[,-10]))
 }
pred <- rowMeans(predictions)
stopCluster(cl)

The problems I am facing are :

  1. the 'not' data frame remain the same size even after the foreach loop (it needs to decrease with each iteration by the 'sel' / the selected records).

  2. none of the variables created within the foreach loop appear to be in existence even after running it - why is this happening ?

Cannot seem to understand where I have gone wrong. Will greatly appreciate if someone can help tell me where I have erred.

P.S. some background info on the problem at hand - I am trying to create bags of equal distribution (relatively equal no of visit [1] and non-visit [0] records) for classification purposes.

r2evans
  • 141,215
  • 6
  • 77
  • 149
vsdaking
  • 476
  • 9
  • 22
  • 4
    The problem you're running into is because the whole environment -- including your global variables -- is not (and will not be) available to the other *R* sessions; you can make some things available to them, but it provides copies, not a reference to the original. You can find some starting pointers at [SO #11583007](http://stackoverflow.com/questions/11583007/communication-of-parallel-processes-what-are-my-options), but the short of it is that you may not be able to easily affect the parent environment from a child process without considerable effort. – r2evans Aug 10 '15 at 08:33
  • Thanks for the prompt reply @r2evans. That is quite insightful - at least I know that my program logic is not wrong per se. However, can you please help me understand (or provide a link) as to why this happens ? Since each session (on each core of the same PC) is being called from the global environment, should the variables not be sent as well ? Also, what do you think can be the work around / solution to such a situation ? – vsdaking Aug 11 '15 at 07:37
  • also, @r2evans, do you know of any way that we can implement bagging in R such that we get roughly equal distribution of elements (split is on binary valued attribute with one being in much larger quantities than the other) ? Thanks – vsdaking Aug 11 '15 at 07:47
  • 1
    There's a difference between "having access to a copy of the global variables" (which child processes do have) and "being able to change the value of global variables in another process (parent or another child)". Somewhat analogous to forking, you are seeing a copy but do not have the ability to update the value in others. Workarounds: shared memory or pipes would be my best guess, but perhaps there's a way to change the logic such that the child processes don't need to update the global data frame? – r2evans Aug 11 '15 at 14:14
  • 1
    Bagging: search for `R binning` and you'll get answers such as [SO #24359863](http://stackoverflow.com/questions/24359863/binning-data-in-r). – r2evans Aug 11 '15 at 14:21

0 Answers0