[Updated below]
I'd like to merge a large dataset (112 megs) with a smaller dataset (<1mg) based on common names. The names are inexact matches between both datasets. There are a number of tutorials on stackoverflow for partial matching OR managing large datasets, but not for both. R tends to freeze when standard methods of partial matching are applied to very large datasets. Below is a some replicable data.
In the large dataset, names appears in all caps, last name first, with occasional suffixes (i.e.)
JUDE, RICHARD J. MR.
In the smaller dataset, they are in the standard "First Name Last Name" format with no commas or suffixes. Each name is has associated variables, such as how much money they gave to a political candidate, or what company they work for.
df1$x <- c("JAYSHREE, JOHNSON D. JR.", "JAMESON, KATHERINE", "TOMMEND, LEONARD"),
df1$p <- c(100, 200, 300)
df2$y <- c("Leo Tommend", "Jay Johnson", "Kathy Jameson")
df2$c <- c("Apple", "Google", "Facebook")
Assume x has a few million rows, y has a few thousand. I've tried grepl, pmatch and a specialized algorithm from another tutorial (here), but R hangs when I try those. I have loaded the X frame with data.table for speed.
I would err on the side of adding to many rows to the merged dataframe if that helps. If there's anything I can do to make this question easier to answer, please let me know in the comments. Thank you for the help
[Update]
Thanks to the commenters,I was able to reduce the number of matches to about 20,000, but that's still far to many. I've included a link to the two files. The two files are 1). Every person in the U.S. who made a political donation in 2012 and 2). the names of every Internet founder.
https://www.dropbox.com/sh/x6tk1pujvfn0fnb/AACQyuICbJPR7VdDf3bbdIwwa?dl=0
When I applied @BondedDust's code, it shrunk significantly! But there's still dozens of duplicate names. So, for instance, if "Aaron" founded a company, everyone named "Aaron" will be added and the new files assumes that 100 people founded the same company and each "Aaron" gave to a different politician.
The goal is to match only the unique instances of each internet founder with their political contributions. I might need to add more data to the matching algorithm than just their names (possibilities include their location, but that's problematic because many Internet founders have multiple homes)
I hope this is helpful!