2

I have R code that I need to get to A "parallelization" stage. Im new at this so please forgive me if I use the wrong terms. I have a process that just has to chug through individual by individual one at a time and then average across individuals in the end. The process is the exact same for each individual (its a Brownian Bridge), I just have to do this for >300 individuals. So, I was hoping someone here might know how to change my code so that it can be spawned? or parallelized? or whatever the word is to make sure that the 48 CPU's I now have access to can help reduce the 58 days it will take to compute this with my little laptop. In my head I would just send out 1 individual to one processor. Have it run through the script and then send another one....if that makes sense.

Below is my code. I have tried to comment in it and have indicated where I think the code needs to be changed.

for (n in 1:(length(IDNames))){ #THIS PROCESSES THROUGH EACH INDIVIDUAL 

#THIS FIRST PART IS JUST EXTRACTING THE DATA FROM MY TWO INPUT FILES.  
#I HAVE ONE FILE WITH ALL THE LOCATIONS AND THEN ANOTHER FILE WITH A DATE RANGE.  
#EACH INDIVIDUAL HAS DIFFERENT DATE RANGES, THUS IT HAS TO PULL OUT EACH INDIVIDUALS 
#DATA SET SEPARATELY AND THEN RUN THE FUNCTION ON IT.

    IndivData = MovData[MovData$ID==IDNames[n],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
      IndivDates = dates[dates$ID==IDNames[n],]
      IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]&IndivData$DateTime<IndivDates$End[1],]
    }
    IndivData$TimeDif[nrow(IndivData)]=NA

    ########################
#THIS IS THE PROCESS WHERE I THINK I NEED THAT HAS TO HAVE EACH INDIVIDUAL RUN THROUGH IT

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

  #############################
  # BELOW IS JUST CODE TO BIND THE RESULTS INTO A GRID DATA FRAME I ALREADY CREATED.  
  #I DO NOT UNDERSTAND HOW THE MULTICORE PROCESSED CODE WOULD JOIN THE DATA BACK 
  #WHICH IS WHY IVE INCLUDED THIS PART OF THE CODE.  

    if(n==1){   #creating a data fram with the x, y, and probabilities for the first individual
      BBMMProbGrid = as.data.frame(1:length(BBMM[[2]]))
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[2]],BBMM[[3]],BBMM[[4]])
      colnames(BBMMProbGrid)=c("GrdId","X","Y",paste(IDNames[n],"_Prob", sep=""))
    } else {                #For every other individual just add the new information to the dataframe
      BBMMProbGrid = cbind(BBMMProbGrid,BBMM[[4]])
      colnames(BBMMProbGrid)[n*2+2]=paste(IDNames[n],"_Prob", sep ="")
    }# end if  


    } #end loop through individuals
Kerry
  • 793
  • 14
  • 33
  • 4
    I don't know what someone voted -1 for, but I suspect because your code is too complicated -- it will take people a long time to sift through it. Can you give us a simplified version that's only 10-20 lines long, still has complete R syntax, but gives the idea of what you want to do? Also, can you tell us a little bit more about your computational setup -- multicores, closely coupled machines, ... ? What approaches have you thought of (see the high performance task view at http://cran.r-project.org/web/views/HighPerformanceComputing.html ) – Ben Bolker Aug 17 '11 at 01:53
  • Seconding Brian. Don't make us do all of your work. Just show us the steps you need to parallelize and we can help you with it. – Maiasaura Aug 17 '11 at 02:21
  • Oops, I meant to type Ben. Sorry! – Maiasaura Aug 17 '11 at 03:18
  • 2
    See also the guidelines about a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Joris Meys Aug 17 '11 at 09:03
  • @Joris Meys - thank you that is an excellent link for someone like me who has been struggling with how to write out my questions for the most optimal help. I will mark that discussion for future reference and hope I improve so I can make the best of this community – Kerry Aug 17 '11 at 15:50
  • @Ben - Thanks for the comment. I appreciate the constructive help. I apologize for too much code. I just didn't know what would be optimal and obviously I over did it. As for the setup - all Ive been told is that it is a multicore setup. I checked your link and have realized this is probably too much to ask of the community to help with - thus the votes towards negative...my mistake! I do not know enough key things it seems. – Kerry Aug 17 '11 at 16:08

3 Answers3

4

Not sure why this has been voted down either. I think the foreach package is what you're after. Those first few pdfs have very clear useful information in them. Basically write what you want done for each person as a function. Then use foreach to send the data for one person out to a node to run the function (while sending another persons to another node etc) and then it compiles all the results using something like rbind. I've used this a few times with great results.

Edit: I didn't look to rework your code as I figure given you've got that far you'll easily have the skills to wrap it into a function and then use the one liner foreach.

Edit 2: This was too long for a comment to reply to you.

I thought since you had got that far with the code that you would be able to get it into a function :) If you're still working on this, it might help to think of writing a for loop to loop over your subjects and do the calculations required for that subject. Then, that for loop is what you want in your function. I think in your code that is everything down to 'area.grid'. Then you can get rid of most of your [n]'s since the data is only subset once per iteration.

Perhaps:

pernode <- function(MovData) {
    IndivData = MovData[MovData$ID==IDNames[i],]
    IndivData = IndivData[1:(nrow(IndivData)-1),]
    if (UseTimeWindow==T){
                         IndivDates = dates[dates$ID==IDNames,]
                         IndivData = IndivData[IndivData$DateTime>IndivDates$Start[1]
                         &IndivData$DateTime<IndivDates$End[1],]
                         }
    IndivData$TimeDif[nrow(IndivData)]=NA

    BBMM <- brownian.bridge(x=IndivData$x, y=IndivData$y,
    time.lag = IndivData$TimeDif[1:(nrow(IndivData)-1)], location.error=20,
    area.grid = Grid, time.step = 0.1)

return(BBMM)
}

Then something like:

library(doMC)
library(foreach)
registerDoMC(cores=48) # or perhaps a few less than all you have

system.time(
  output <- foreach(i = 1:length(IDNames)), .combine = "rbind", .multicombine=T,
 .inorder = FALSE) %dopar% {pernode(i)}
)

Hard to say whether that is it without some test data, let me know how you get on.

nzcoops
  • 9,132
  • 8
  • 41
  • 52
  • @nezcoops - aah you have a lot of faith in me brother to think ive got the right skillzzz! Thanks. It has been an struggle using other peoples scripts and examples to produce what ive currently got. However, I will persevere. I have never used the lapply command and this is the first ive heard of the foreach. – Kerry Aug 17 '11 at 15:45
  • extended the answer as I ran out of space in the comments field. – nzcoops Aug 17 '11 at 23:52
4

This is a general example since I didn't have the patience to read through all of your code. One of the quickest ways to spread this across multiple processors would be to use the multicore library and the mclapply (a parallelized version of lapply) to push a list (individual items on the list would be dataframes for each of the 300+ individuals in your case) through a function.

Example:

library(multicore)
result=mclapply(data_list, your_function,mc.preschedule=FALSE, mc.set.seed=FALSE)
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • Thank you for your suggestion, I have cut down the code to the section I believe would be parsed out to the various processors. There is just 1 major function in the code (the brownian.bridge command). This is what needs to be calculated for each individual in my data set. Where I start with the IF statement is where I am taking the results and binding them into a Grid. I assume this is unnecessary to send out to a cluster but I don't understand how the process would know to come back and join together from all the different processors. – Kerry Aug 17 '11 at 15:59
  • 1
    For loops are inefficient in R. Throw the data for each individual into a list using plyr. example: individuals=dlply(original_data,.(IDNames)). Next, create a wrapper function (pass one list item at a time to that function) that includes both the bbbm function and formatting the results. So when multicore runs, it will do both and return a list where the individual items are the results bound to grid. Then you can do whatever (eg. collapse the list, plot) etc. – Maiasaura Aug 17 '11 at 16:25
2

As I understand you description you have access to a distributed computer cluster. So the package multicore will be not working. You have to use Rmpi, snow or foreach. Based on your existing loop structure I would advice to use the foreach and doSnow package. But your codes looks like as you have a lot of data. You probably have to check to reduce the data (only the required ones) which will be send to the nodes.