Partial string merge between 2 large datasets on R

Question

I have two dataframes, the 1st one contains about 900K observations and 2 columns :

ID	COMPANY
AD8.OSZ.23490	Company1
AD8.OSZ.18903	Company2
AD8.OSZ.90126	Company3

The second ones contains about 130k observations and also 2 columns, but the ID format is different (but not all observations are in the same form, for some there is no AD8.OSZ. for instance).

ID	Client_Since
desr-j50q02-AD8.OSZ.23490	1981
desr-j50q02-AD8.OSZ.18903	2003
desr-j50q02-AD8.OSZ.90126	2018

DESIRED OUTPUT

Full_ID	Client_Since	Company
desr-j50q02-AD8.OSZ.23490	1981	Company1
desr-j50q02-AD8.OSZ.18903	2003	Company2
desr-j50q02-AD8.OSZ.90126	2018	Company3

I tried 2 codes for my left join (i want to keep all the 130k obs) :

#1st 

library(fuzzyjoin)

df3 <- df %>% regex_left_join(df2, by = c(Full_ID = "ID"))

#2nd code

library(stringr)

df3 <- df %>% fuzzy_left_join(df2, by = c("Full_ID" = "ID"), match_fun = str_detect)

Error : memory vectors exhausted (limit reached ?)

I think that this code is too weak for the datasets i have / not appropriate for my MacbookAir. I did the manipulation found here : R on MacOS Error: vector memory exhausted (limit reached?) but it didn't change anything.

I read about "parallelizing" the use of R (https://datasquad.at.sites.carleton.edu/data/storage-design/dealing-with-a-vector-memory-exhausted-error-in-r/) but i really don't understand how to use mclapply with my join command.

I also looked at that topic : Partial string merge R large dataset but is not exactly the same case as me.

score 2 · Accepted Answer · answered Jul 19 '21 at 21:23

There is another way to think of it where from the sample you have shared you always seek IDs after the last period, hence you can create a new column with text after last period and join using it.

Below is example of how you can do that;

# Reading required libraries
library(dplyr)
library(stringr)

# Create sample dataframes
df1 <-
  data.frame(ID = c("AD8.OSZ.23490", "AD8.OSZ.18903", "AD8.OSZ.90126"),
             COMPANY = c("Company1", "Company2", "Company3"))

df2 <-
  data.frame(ID = c("desr-j50q02-AD8.OSZ.23490", "desr-j50q02-AD8.OSZ.18903", "desr-j50q02-AD8.OSZ.90126"),
             Client_Since = c("1981", "2003", "2018"))

# Modify first dataframe
mod_df1 <-
  df1 %>%
  # Get characters after last period
  mutate(MOD_ID = sub('.*\\.', '', ID))

# Modify second dataframe
mod_df2 <-
  df2 %>%
  # Get characters after last period
  mutate(MOD_ID = sub('.*\\.', '', ID))

# Join tables
mod_df1 %>%
  left_join(mod_df2, by = c("MOD_ID"))

Partial string merge between 2 large datasets on R

1 Answers1