0

I have two dataframes, the 1st one contains about 900K observations and 2 columns :

ID COMPANY
AD8.OSZ.23490 Company1
AD8.OSZ.18903 Company2
AD8.OSZ.90126 Company3

The second ones contains about 130k observations and also 2 columns, but the ID format is different (but not all observations are in the same form, for some there is no AD8.OSZ. for instance).

ID Client_Since
desr-j50q02-AD8.OSZ.23490 1981
desr-j50q02-AD8.OSZ.18903 2003
desr-j50q02-AD8.OSZ.90126 2018

DESIRED OUTPUT

Full_ID Client_Since Company
desr-j50q02-AD8.OSZ.23490 1981 Company1
desr-j50q02-AD8.OSZ.18903 2003 Company2
desr-j50q02-AD8.OSZ.90126 2018 Company3

I tried 2 codes for my left join (i want to keep all the 130k obs) :

#1st 

library(fuzzyjoin)

df3 <- df %>% regex_left_join(df2, by = c(Full_ID = "ID"))

#2nd code

library(stringr)

df3 <- df %>% fuzzy_left_join(df2, by = c("Full_ID" = "ID"), match_fun = str_detect)

Error : memory vectors exhausted (limit reached ?)

I think that this code is too weak for the datasets i have / not appropriate for my MacbookAir. I did the manipulation found here : R on MacOS Error: vector memory exhausted (limit reached?) but it didn't change anything.

I read about "parallelizing" the use of R (https://datasquad.at.sites.carleton.edu/data/storage-design/dealing-with-a-vector-memory-exhausted-error-in-r/) but i really don't understand how to use mclapply with my join command.

I also looked at that topic : Partial string merge R large dataset but is not exactly the same case as me.

katdataecon
  • 185
  • 8

1 Answers1

2

There is another way to think of it where from the sample you have shared you always seek IDs after the last period, hence you can create a new column with text after last period and join using it.

Below is example of how you can do that;

# Reading required libraries
library(dplyr)
library(stringr)

# Create sample dataframes
df1 <-
  data.frame(ID = c("AD8.OSZ.23490", "AD8.OSZ.18903", "AD8.OSZ.90126"),
             COMPANY = c("Company1", "Company2", "Company3"))

df2 <-
  data.frame(ID = c("desr-j50q02-AD8.OSZ.23490", "desr-j50q02-AD8.OSZ.18903", "desr-j50q02-AD8.OSZ.90126"),
             Client_Since = c("1981", "2003", "2018"))

# Modify first dataframe
mod_df1 <-
  df1 %>%
  # Get characters after last period
  mutate(MOD_ID = sub('.*\\.', '', ID))

# Modify second dataframe
mod_df2 <-
  df2 %>%
  # Get characters after last period
  mutate(MOD_ID = sub('.*\\.', '', ID))

# Join tables
mod_df1 %>%
  left_join(mod_df2, by = c("MOD_ID"))
Nareman Darwish
  • 1,251
  • 7
  • 14