I have a 1.3 million-row dataset of publications and, for each record, I want to retrieve a paper_id from a second dataset with 8.6 million rows. The idea is to use multiple columns from both tables to find matches for dataset1 in dataset2 as shown in this functional, yet simplified, script:
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2 %>% select(Title, Year, Publication_id, Paper_id),
by = list(x = c("Title", "Year", "Publication_id"),
y = c("Title", "Year", "Publication_id"))
max_dist = 3, ignore_case = TRUE, distance_col = NULL)
I have two problems here. The first is that only "Title" has variations (typos, abbreviations, special characters, etc.) that would require fuzzy matching, but the code accepts variations in all three of the used fields. This inflates the number of possible matches with incorrect ones, as similar titles appear across different years and publications.
A solution I could think that would solve this first problem would be:
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2 %>%
select(Title2 = Title, Year2 = Year, Pub_id2 = Publication_id, Paper_id),
by = list(x = c("Title", "Year", "Publication_id"),
y = c("Title2", "Year2", "Pub_id2"))
max_dist = 3, ignore_case = TRUE, distance_col = NULL) %>%
filter(Year == Year2, Publication_id == Pub_id2)
This would solve the first problem, but there is the second one: the script runs ok on sample data, but when I try to use it for the whole dataset, it gives the error "vector memory exhausted (limit reached?)".
So, my questions are:
Would it be possible to determine which columns should be identical and which should be fuzzy, which might make the script lighter?
Is there a possibility to subset both datasets according to the identical variables, and then run the fuzzy match on the title within the subsets, merging everything back? (I'm not sure if this would work, as I would have 180 thousand subsets: 30 thousand publications from six years).
Any help is appreciated.
Best