I have two dataframes, the 1st one contains about 900K observations and 2 columns :
ID | COMPANY |
---|---|
AD8.OSZ.23490 | Company1 |
AD8.OSZ.18903 | Company2 |
AD8.OSZ.90126 | Company3 |
The second ones contains about 130k observations and also 2 columns, but the ID format is different (but not all observations are in the same form, for some there is no AD8.OSZ. for instance).
ID | Client_Since |
---|---|
desr-j50q02-AD8.OSZ.23490 | 1981 |
desr-j50q02-AD8.OSZ.18903 | 2003 |
desr-j50q02-AD8.OSZ.90126 | 2018 |
DESIRED OUTPUT
Full_ID | Client_Since | Company |
---|---|---|
desr-j50q02-AD8.OSZ.23490 | 1981 | Company1 |
desr-j50q02-AD8.OSZ.18903 | 2003 | Company2 |
desr-j50q02-AD8.OSZ.90126 | 2018 | Company3 |
I tried 2 codes for my left join (i want to keep all the 130k obs) :
#1st
library(fuzzyjoin)
df3 <- df %>% regex_left_join(df2, by = c(Full_ID = "ID"))
#2nd code
library(stringr)
df3 <- df %>% fuzzy_left_join(df2, by = c("Full_ID" = "ID"), match_fun = str_detect)
Error : memory vectors exhausted (limit reached ?)
I think that this code is too weak for the datasets i have / not appropriate for my MacbookAir. I did the manipulation found here : R on MacOS Error: vector memory exhausted (limit reached?) but it didn't change anything.
I read about "parallelizing" the use of R (https://datasquad.at.sites.carleton.edu/data/storage-design/dealing-with-a-vector-memory-exhausted-error-in-r/) but i really don't understand how to use mclapply with my join command.
I also looked at that topic : Partial string merge R large dataset but is not exactly the same case as me.