4

I have the following R "apply" statement:

for(i in 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation))
{
    matrix_of_sums[,i]<-
    apply(simulation_results[,colnames(simulation_results) %in% 
    dataframe_stuff_that_needs_lookup_from_simulation[i,]],1,sum)
}

So, I have the following data structures:

simulation_results: A matrix with column names that identify every possible piece of desired simulation lookup data for 2000 simulations (rows).

dataframe_stuff_that_needs_lookup_from_simulation: Contains, among other items, fields whose values match the column names in the simulation_results data structure.

matrix_of_sums: When function is run, a 2000 row x 250,000 column (# of simulations x items being simulated) structure meant to hold simulation results.

So, the apply function is looking up the dataframe columns values for each row in a 250,000 data set, computing the sum, and storing it in the matrix_of_sums data structure.

Unfortunately, this processing takes a very long time. I have explored the use of rowsums as an alternative, and it has cut the processing time in half, but I would like to try multi-core processing to see if that cuts processing time even more. Can someone help me convert the code above to "lapply" from "apply"?

Thanks!

Matthya C
  • 83
  • 1
  • 1
  • 7
  • 1
    Going from `apply` to `lapply` isn't likely to make much of a difference. But it looks like you are just doing a row sum so look at `rowSums()`. Try looking at the [foreach](https://cran.r-project.org/package=foreach) package if you want to try parallelizing computation. – MrFlick Oct 02 '17 at 19:55
  • why don't you share your solution with rowsums ? – moodymudskipper Oct 02 '17 at 19:58
  • @MrFlick Hey, thanks for responding. Actually, I'm trying to convert to lapply so that I can then use the mclapply function, which processes with multiple cores. – Matthya C Oct 02 '17 at 20:05
  • @Moody_Mudskipper The rowsums version looks like this: matrix_of_sums[, i] <- rowSums(simulation_results[,colnames(simulation_results) %in% dataframe_stuff_that_needs_lookup_from_simulation[1,]]) – Matthya C Oct 02 '17 at 20:07

2 Answers2

5

With base R parallel, try

library(parallel)
cl <- makeCluster(detectCores())
matrix_of_sums <- parLapply(cl, 1:nrow(dataframe_stuff_that_needs_lookup_from_simulation), function(i)
    rowSums(simulation_results[,colnames(simulation_results) %in% 
        dataframe_stuff_that_needs_lookup_from_simulation[i,]]))
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)

You could also try foreach %dopar%

library(doParallel)  # will load parallel, foreach, and iterators
cl <- makeCluster(detectCores())
registerDoParallel(cl)
matrix_of_sums <- foreach(i = 1:NROW(dataframe_stuff_that_needs_lookup_from_simulation)) %dopar% {
    rowSums(simulation_results[,colnames(simulation_results) %in% 
    dataframe_stuff_that_needs_lookup_from_simulation[i,]])
}
stopCluster(cl)
ans <- Reduce("cbind", matrix_of_sums)

I wasn't quite sure how you wanted your output at the end, but it looks like you're doing a cbind of each result. Let me know if you're expecting something else however.

CPak
  • 13,260
  • 3
  • 30
  • 48
  • Hey, thanks for the response. When I used your code, I received the following error: Error in makePSOCKcluster(spec, ...) : numeric 'names' must be >= 1. I think this may be caused by the call the detectCores(), which is returning "NA" on my machine. I'm not sure why just yet. – Matthya C Oct 02 '17 at 20:22
  • That is strange. You can also specify the number of available cores manually, if you know. `makeCluster(8)` for instance. – CPak Oct 02 '17 at 20:25
  • I think we're getting close. It is detecting the number of nodes, but is unable to locate one of the original objects upon which I'm operating, simulation_results. That is strange, since I can see that it is full of data when I type its name into the console. Do I need to do anything else to pass that object to the cores? Here is the error that I'm getting: Error in checkForRemoteErrors(val) : 8 nodes produced errors; first error: object 'simulation_results' not found – Matthya C Oct 02 '17 at 20:42
  • Okay, I added the following line: clusterExport(cl=cl, varlist=c("simulation_results")), as recommended from here: https://stackoverflow.com/questions/12019638/using-parallels-parlapply-unable-to-access-variables-within-parallel-code I'm still getting errors, but it got me past the last step. – Matthya C Oct 02 '17 at 20:49
  • Yes, important to export your data *AND* packages (if you use any in the lapply or foreach loop). You can export packages with `clusterEvalQ(cl, { library(example)})` using base R parallel, and `foreach(..., .packages=c("example")...)` with `foreach` – CPak Oct 02 '17 at 21:03
  • This worked great. Cut my job from 24 minutes to 3'10. Thanks so much for your help! – Matthya C Oct 02 '17 at 23:45
  • Just FYI, you can also acknowledge contributions by accepting solutions that helped (check mark to the left). This also lets the community know the solution worked for you. If you accept another solution, that's okay with me...just wanted you to know. – CPak Oct 03 '17 at 13:48
  • Thanks for the heads up! I accepted this solution because it produced a slightly better timing result. – Matthya C Oct 03 '17 at 14:39
0

without really having any applicable or sample data to go off of... the process would look like this:

  • Create a holding matrix(matrix_of_sums)
  • loop by row through variable table(dataframe_stuff_that_needs_lookup_from_simulation)
  • find matching indices within the simulation model(simulation_results)
  • bind the rowSums into the holding matrix(matrix of sums)

I recreated a sample set which is meaningless and produces identical results but should work for your data

# Holding matrix which will be our end-goal
msums <- matrix(nrow = 2000,ncol = 0)
# Loop
    parallel::mclapply(1:nrow(ts_df), function(i){
       # Store the row to its own variable for ease
       d <- ts_df[i,]
       # cbind the results using the global assignment operator `<<-`
       msums <<- cbind(
                 msums, 
                 rowSums(
                    sim_df[,which(colnames(sim_df) %in% colnames(d))]
            ))
    }, mc.cores = parallel::detectCores(), mc.allow.recursive = TRUE)
Carl Boneri
  • 2,632
  • 1
  • 13
  • 15
  • Hey Carl, I'm going to try your solution out next. Thanks for your response! – Matthya C Oct 02 '17 at 22:05
  • Carl, this version worked, too (I think - I received some core errors, but the job finished), but the timing was slightly worse than CPak's with 7 cores running. Your version cut my job from 24 minutes to 3'50". Thanks a lot for your help! – Matthya C Oct 02 '17 at 23:44
  • what environment are you running? for instance I use rstudio server 64x on ubuntu16.04 with 32 cores. 7 cores sounds odd is why I ask – Carl Boneri Oct 03 '17 at 00:07
  • also be helpful if we had the actual data.. since there may be much faster implementations with data.table etc – Carl Boneri Oct 03 '17 at 00:07
  • I'm running Rstudio 1.0.153 on my home desktop with Ubuntu 16.04, an Intel i7-6700K with eight cores, and 32 GB of DDR4 RAM. Ultimately, I'd like to move over to AWS. The dataframe I reference in my original post consists of a number of float fields that are associated with string fields that need to be matched to the simulation matrix's columns for repeated lookups. The float fields in the dataframe are not important and could be removed; only the string fields and how they match to the lookup matrix column names are important. Any additional suggestions would be great! Thanks. – Matthya C Oct 03 '17 at 06:00
  • I used 7 cores in my run because several articles I read suggested using (total available cores - 1). – Matthya C Oct 03 '17 at 06:04