I am relatively new to parallel processing in R. Have been playing around with some code when I stumbled into a slight problem - the code in the foreach loop does not seem to manipulate certain variables / data frames, whilst actually making some predictions (aim of the program). My code is as follows -
library(parallel)
library(doParallel)
library(foreach)
library(iterators)
# select people who visited and who did not visit separately
vis <- b[which(b$visit==1),]
no <- not <- b[which(b$visit!=1),]
# create parallel processing environment for 2 (TWO) processors
cl<-makeCluster(2)
registerDoParallel(cl)
iterations <- 3
#k=f=1
predictions <- foreach(icount(iterations), .combine=cbind) %dopar% {
# randomly select training & testing set from Visited customers
pos <- sample(nrow(vis),size=floor(nrow(vis)/10*7),replace=FALSE)
train<- vis[pos,]
test<- vis[-pos,]
# create distinct and non-repeatable bags for Non-visited customers
sel <- sample(nrow(not), size=9246, replace=FALSE)
#train1 <-1:nrow(not) %in% sel
no <- not[sel,]
not <- not[-sel,]
# randomly select training & testing set from Non-Visited customers
pos1 <- sample(nrow(no),size=floor((nrow(no)/10)*7),replace=FALSE)
trainNo <- no[pos1,]
testNo <- no[-pos1,]
# combine the train & test Bags of both Visit & Non-Visit customers
trainSet <- rbind(train,trainNo)
testSet <- rbind(test,testNo)
fit <- glm(visit~., data=trainSet, family=binomial(logit))
#pr <-
print(length(not))
predict(fit,testSet[,-10])
#pr <- rbind(pr,predict(fit,testSet[,-10]))
}
pred <- rowMeans(predictions)
stopCluster(cl)
The problems I am facing are :
the 'not' data frame remain the same size even after the foreach loop (it needs to decrease with each iteration by the 'sel' / the selected records).
none of the variables created within the foreach loop appear to be in existence even after running it - why is this happening ?
Cannot seem to understand where I have gone wrong. Will greatly appreciate if someone can help tell me where I have erred.
P.S. some background info on the problem at hand - I am trying to create bags of equal distribution (relatively equal no of visit [1] and non-visit [0] records) for classification purposes.