I am relatively new to parallel processing, and though feeling "close" with my first application of it, I am running into a persistent problem with proper indexing.
The context
I have a character vector called ids
of 11 values, corresponding to people's anonymized identification strings:
ids <- c("BC002", "EF001", "KK002", "KM004", "MH003", "TL001", "TTS123", "TTS54", "TTS91", "TTS94", "TTS95")
I have a defined function called scrapePerson_and_handle_error
that uses a tryCatch with another function of mine (scrapePerson
) to import a given person/subdirectory's .json files, based on their correspond level in ids
(it finds/uses the corresponding file path) and binding them together. Key here is that this function works (i.e., I am pretty certain it's how I'm using foreach
that's the problem, rather than scrapePerson_and_handle_error
:
scrapePerson_and_handle_error <- function(id) {
tryCatch(
{
newdf <- scrapePerson(idTarget = id)
newdf
},
error = function(e) {
message("Errors with scraping data for ID:", id)
NULL
}
)
}
I also feel as though I have followed all the appropriate steps for parallel processing based on the packages I'm using, including determining the appropriate number of cores to use, and registering the parallel backend:
num_cores <- parallel::detectCores()
cl <- parallel::makeCluster(num_cores)
doParallel::registerDoParallel(cl)
And in the end, what I am attempting to do is distribute scraping each person's data in parallel (i.e., dividing people to scrape over cores available), and then binding their data frames at the end.
The troublesome code
results <- foreach::foreach(id = ids, .combine = dplyr::bind_rows) %dopar% { scrapePerson_and_handle_error(id) }
When I run this, I get a data frame of 0 observations of 0 variables. Obviously not the desired outcome.
But when I just tinker with indexing (note: referring to ids directly now), and run the function directly, I find that the following produces the desired output (i.e., a data frame of all the first person's merged .json files):
scrapePerson_and_handle_error(ids[1])
I'd be very appreciative of any insights for correcting my attempt to parallelize this process!