1

I have a function that needs to manipulate three data frames, all with different structure:

  • a: Original data frame. It is a parameter for my function. I need to remove rows from here, given certain conditions.
  • b: New data frame created in my function. My function adds all the rows here.
  • c: Another new data frame created in my function. My function adds all the rows here.

In order to try the parallel processing, I sat up a minimal code (following this question and this blog) in which I only generated b:

# Set up the parallel
registerDoParallel( makeCluster(3L) )

b <- foreach(i = 1:nrow(f), .combine = rbind) %dopar% {
  tempB <- do_something_function()

  tempB
}

That example works perfectly, but I'm missing two data frames. I found other answers, but I do believe my case is different:

I could change a to be a data frame of rows that would later be removed, but I need to merge all tempA with only tempA... if that makes any sense. In the previous questions I linked, they mix all of the outputs.

Carrol
  • 1,225
  • 1
  • 16
  • 29

2 Answers2

2

It seems that your problem has nothing to do with parallelism, but rather about combining the results.

An example of solution of how I would do it (which I think is the most efficient way to do it):

library(foreach)
tmp <- foreach(i = seq_len(32)) %do% {
  list(iris[i, ], mtcars[i, ], iris[i, ])
}

lapply(purrr::transpose(tmp), function(l) do.call(rbind, l))
F. Privé
  • 11,423
  • 2
  • 27
  • 78
0

I found this solution so far. Instead of removing from a, I'm creating a data frame that is the rows that will be deleted. I wrote a combine function:

combine <- function(x, ...) {  
  mapply(rbind, x, ..., SIMPLIFY = FALSE)
}

And my loop is something like this:

# Set up the parallel
registerDoParallel( makeCluster(3L) )

# Loop
output <- foreach(i = 1:nrow(f), .combine = combine, .multicombine = TRUE) %dopar% {
  tempA <- get_this_value()
  tempB <- do_something_function()
  tempC <- get_this_other_frame()

  # Return the values
  list(tempA, tempB, tempC)
}

Then, I access the data using output[[1]] and so on. However, for this solution I'll still have to do a setdiff or anti_join after the loop, to remove the "undesired" rows from a.

Carrol
  • 1,225
  • 1
  • 16
  • 29