0

I am relatively new to parallel processing, and though feeling "close" with my first application of it, I am running into a persistent problem with proper indexing.

The context

I have a character vector called ids of 11 values, corresponding to people's anonymized identification strings:

ids <- c("BC002", "EF001", "KK002", "KM004", "MH003", "TL001", "TTS123", "TTS54", "TTS91", "TTS94", "TTS95")

I have a defined function called scrapePerson_and_handle_error that uses a tryCatch with another function of mine (scrapePerson) to import a given person/subdirectory's .json files, based on their correspond level in ids (it finds/uses the corresponding file path) and binding them together. Key here is that this function works (i.e., I am pretty certain it's how I'm using foreach that's the problem, rather than scrapePerson_and_handle_error:

scrapePerson_and_handle_error <- function(id) {
    tryCatch(
      {
        newdf <- scrapePerson(idTarget = id)
        newdf
      },
      error = function(e) {
        message("Errors with scraping data for ID:", id)
        NULL
      }
    )
  }

I also feel as though I have followed all the appropriate steps for parallel processing based on the packages I'm using, including determining the appropriate number of cores to use, and registering the parallel backend:

  num_cores <- parallel::detectCores() 

  cl <- parallel::makeCluster(num_cores)
  doParallel::registerDoParallel(cl)

And in the end, what I am attempting to do is distribute scraping each person's data in parallel (i.e., dividing people to scrape over cores available), and then binding their data frames at the end.

The troublesome code

results <- foreach::foreach(id = ids, .combine = dplyr::bind_rows) %dopar% { scrapePerson_and_handle_error(id) }

When I run this, I get a data frame of 0 observations of 0 variables. Obviously not the desired outcome.

But when I just tinker with indexing (note: referring to ids directly now), and run the function directly, I find that the following produces the desired output (i.e., a data frame of all the first person's merged .json files):

scrapePerson_and_handle_error(ids[1])

I'd be very appreciative of any insights for correcting my attempt to parallelize this process!

jsakaluk
  • 549
  • 4
  • 19
  • 2
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. It's really not easy to help if we have no idea what's really in these variables. What happens if you use `%do%` rather than `%dopar%`? Same result? – MrFlick Aug 17 '23 at 21:02
  • I totally understand. I've provided a bit more detail re: my variable ids and the function I'm trying to use iteratively, and the mainstream packages/functions are already depicted. Unfortunately, given the kinds of data I'm dealing with + using a boutique function in the defined function makes a fully reproducible example difficult to generate. To answer your follow up: %do% generates the appropriate output: a df of 25028 obs of 11 variables. – jsakaluk Aug 17 '23 at 21:14
  • 1
    A `message` is useless in parallel processing, you won't even see it. Have your error catcher return `e` and remove `.combine = dplyr::bind_rows`, so you can see the actual error message. I don't know the `scrapePerson` function but it looks like it might rely on objects from the global environment, which are not exported to the wokers. You would fix this by passing these objects as function arguments or by telling `foreach` explicitly to export these objects to the workers. – Roland Aug 18 '23 at 05:57

1 Answers1

1

if you're willing to try a different package for parallelization, I've become quite fond of future.apply. It largely mirrors the behavior of the hopefully familiar apply* functions.

Something like this would output a list of results from your function, spitting out any error messages when done.

library(future)
library(future.apply)

plan("multisession")

out <- future_lapply(
    future.seed = NULL, # prevent error message
    X = ids,
    FUN = scrapePerson_and_handle_error
)
postitman
  • 126
  • 7