4

I'm trying to speed up a fuzzyjoin with parallel processing. I have two dataframes, each with several thousand rows each which need to be partially regex joined. However its currently taking over 40 minutes on a single core. The dataframe looks like:

require(fuzzyjoin)

df1 <- data.frame(first_last = c('Jackie S', 'James P', 'Jenny C', 'Jack N'),
                  age = sample(18:65, 4), 
                  stringsAsFactors = F)
df2 <- data.frame(id = c(1:6), 
                  full_name = c('Jackie Smith, CPA', 
                                'Joe Campbell III',
                                'James Park, MD', 
                                'Joyce May, DDS',
                                'Jenny Cox',
                                'Jack Null Jr'), 
                  stringsAsFactors = F)

merged <- regex_right_join(df2, df1, by = c('full_name' = 'first_last'))

(I'm using regex_right_join because regex_left_join isn't working).

To run with parallel processing I have tried with

require(doParallel)
require(foreach)

cl <- makeCluster(4)
registerDoParallel(cl)

parallel_merged <- foreach(i=1, .combine = rbind) %dopar%
  fuzzyjoin::regex_right_join(df2, df1, by = c('full_name' = 'first_last'))

The user and system time are always very low when using doParallel and foreach. Both are < 1s. However the elapsed time with foreach is always about the same as running on a single core (40+ minutes).

Highland
  • 148
  • 1
  • 7
  • There's only one call to `fuzzyjoin::regex_right_join()` when you use `foreach(i=1, ...)` - all you get is running that code in one of the four background R workers. So, yes, same processing time as if running in the master process + overhead of communicating with the background worker. – HenrikB Nov 16 '17 at 21:10
  • PS. Don't use `require()` - always use `library()`. – HenrikB Nov 16 '17 at 21:10
  • @HenrikB If I do anything other than `i=1` like `i=1:nrow(df)` then the merged dataframe is duplicated by the number of rows and ends up running longer. – Highland Nov 16 '17 at 21:52
  • @Highland `i` is that parameter that is parallelized. If `i` is only one integer, then you can't parallelize that, clearly. – thc Nov 16 '17 at 22:22
  • Also, what is your intended behavior if there is more than one match? E.g., if there is a Jenny Chan as well as a Jenny Cox? – thc Nov 16 '17 at 22:26
  • @thc With the available data just match one-to-many where that are more than one matches. – Highland Nov 17 '17 at 13:58

0 Answers0