I'm trying to speed up a fuzzyjoin with parallel processing. I have two dataframes, each with several thousand rows each which need to be partially regex joined. However its currently taking over 40 minutes on a single core. The dataframe looks like:
require(fuzzyjoin)
df1 <- data.frame(first_last = c('Jackie S', 'James P', 'Jenny C', 'Jack N'),
age = sample(18:65, 4),
stringsAsFactors = F)
df2 <- data.frame(id = c(1:6),
full_name = c('Jackie Smith, CPA',
'Joe Campbell III',
'James Park, MD',
'Joyce May, DDS',
'Jenny Cox',
'Jack Null Jr'),
stringsAsFactors = F)
merged <- regex_right_join(df2, df1, by = c('full_name' = 'first_last'))
(I'm using regex_right_join because regex_left_join isn't working).
To run with parallel processing I have tried with
require(doParallel)
require(foreach)
cl <- makeCluster(4)
registerDoParallel(cl)
parallel_merged <- foreach(i=1, .combine = rbind) %dopar%
fuzzyjoin::regex_right_join(df2, df1, by = c('full_name' = 'first_last'))
The user and system time are always very low when using doParallel
and foreach
. Both are < 1s. However the elapsed time with foreach
is always about the same as running on a single core (40+ minutes).