2

I'm trying to merge two large datasets. The common variable, first and last name, vary in spelling between the datasets and there are many duplicates, even between similarly spelled names. I've included download links for the files and some R code below. I'll walk through what I've tried and what went wrong.

There are a few R tutorials that have tried to tackle (the common) problem of record linking, but none of dealt with large datasets. I'm hoping the SO community can help me solve this problem.

--Attempted code matching with regular expressions--

My first attempt, thanks to the help of previous SO suggestions, was to use agrep and regular string matching. This narrowed down the names, but resulted in too many duplicates

#Load files#

expends12 <- fread("file path for FEC", sep="|", header=FALSE)
crunchbase.raw <- fread("file path for internet founders")

exp <- expends12
cr <- crunchbase.raw

#user regular string matching#
exp$xsub= gsub("^([^,]+)\\, (.{7})(.+)", "\\2 \\1", tolower(expends12$V8))
cr$ysub= gsub("^(.{7})([^ ]+) (.+)", "\\1 \\3", tolower(cr$name))

#merge files#
fec.merge <- merge(exp,cr, by.x="xsub", by.y="ysub")

The result is 6,900 rows, so there are (a lot) of duplicates. Many rows are people with similar names as Internet founders, such as Alexander Black, but are from different states and have different job titles. So, now its a question of finding the real Internet founder.

One option to narrow the results would be filter the results by states. So, I might only take the Alexander Black from California or New York, because that is where most startups are founded. I might also only take certain job titles, such as CEO or founder. But, many founders had jobs before and after their companies, so i wouldn't want to narrow by job title too much.

Alternatively, there is an r package, RecordLinkage, but as I far as I can tell, there needs to be similar rows and columns between the datasets, which is a nonstarter for this task

I'm familiar with R, but have somewhat limited statistical knowledge and programming ability. Any step-by-step help is very much appreciated. Thank you and please let me know if there's any trouble downloading the data.

tom
  • 977
  • 3
  • 14
  • 30
  • I don't think that you can have an answer to this question as it is asked. People generally avoid downloading a complete real data-set( time, size,security,..). Better to divide your big problem in sub-problems ( duplicated matching pattern) and create a small data set that expose it. – agstudy Mar 27 '15 at 21:33
  • Thanks. I tried that before with simple mock data. The advice ended up not working because when I tried it on the larger dataset, there were too many duplicates. If there's another way, i'm happy to try it! – tom Mar 27 '15 at 22:13

1 Answers1

0

Why don't you select the columns you need from both datasets, rename them similarly and in the result object, you get the row indices for matches returned. As long as you don't reorder things, you can use the results to match both datasets.