0

[Updated below]

I'd like to merge a large dataset (112 megs) with a smaller dataset (<1mg) based on common names. The names are inexact matches between both datasets. There are a number of tutorials on stackoverflow for partial matching OR managing large datasets, but not for both. R tends to freeze when standard methods of partial matching are applied to very large datasets. Below is a some replicable data.

In the large dataset, names appears in all caps, last name first, with occasional suffixes (i.e.)

JUDE, RICHARD J. MR.

In the smaller dataset, they are in the standard "First Name Last Name" format with no commas or suffixes. Each name is has associated variables, such as how much money they gave to a political candidate, or what company they work for.

df1$x <- c("JAYSHREE, JOHNSON D. JR.", "JAMESON, KATHERINE", "TOMMEND, LEONARD"),
df1$p <- c(100, 200, 300)

df2$y <- c("Leo Tommend", "Jay Johnson", "Kathy Jameson")
df2$c <- c("Apple", "Google", "Facebook")

Assume x has a few million rows, y has a few thousand. I've tried grepl, pmatch and a specialized algorithm from another tutorial (here), but R hangs when I try those. I have loaded the X frame with data.table for speed.

I would err on the side of adding to many rows to the merged dataframe if that helps. If there's anything I can do to make this question easier to answer, please let me know in the comments. Thank you for the help

[Update]

Thanks to the commenters,I was able to reduce the number of matches to about 20,000, but that's still far to many. I've included a link to the two files. The two files are 1). Every person in the U.S. who made a political donation in 2012 and 2). the names of every Internet founder.

https://www.dropbox.com/sh/x6tk1pujvfn0fnb/AACQyuICbJPR7VdDf3bbdIwwa?dl=0

When I applied @BondedDust's code, it shrunk significantly! But there's still dozens of duplicate names. So, for instance, if "Aaron" founded a company, everyone named "Aaron" will be added and the new files assumes that 100 people founded the same company and each "Aaron" gave to a different politician.

The goal is to match only the unique instances of each internet founder with their political contributions. I might need to add more data to the matching algorithm than just their names (possibilities include their location, but that's problematic because many Internet founders have multiple homes)

I hope this is helpful!

tom
  • 977
  • 3
  • 14
  • 30

1 Answers1

1

Here's a shot: Use first three letters of first name and full last name as matching criteria:

df1$xsub= gsub("^([^,]+)\\, (.{3})(.+)", "\\2 \\1", tolower(df1$x) )
df2$ysub= gsub("^(.{3})([^ ]+) (.+)", "\\1 \\3", tolower(df2$y) )


 merge(df1,df2, by.x="xsub", by.y="ysub")

#----------------
         xsub                  x   p             y        c
1 kat jameson JAMESON, KATHERINE 200 Kathy Jameson Facebook
2 leo tommend   TOMMEND, LEONARD 300   Leo Tommend    Apple

There's also a Soundex functions (soundex) described in the ?phonetics help page for pkg:RecordLinkage that might help with names that are spelled vaguely alike, eg. Kathleen and Cathy, or John and Jon. I don't have any experience with shortening Soundex keys, but there might be opportunities to build on that work.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • This is a really good start, but I tried it on my larger dataset and I got far too many duplicate names to handle. I should have just shared the original documents from the beginning, i'm doing to add more detail in the edited version along with a link to the full file – tom Mar 11 '15 at 23:11
  • You said: "I would err on the side of adding to many rows to the merged dataframe if that helps. " (I assumed you meant "err on side of too many".) You need to be more specific about your goals. You also need to offer any further data fields that might exist to disambiguate among the very common names. – IRTFM Mar 11 '15 at 23:51
  • Your right, i'm going to redo this on a separate thread and add more detail about the data and my goals. Thanks for your help so far – tom Mar 12 '15 at 18:52