Fixing string variables with varying spellings, etc

Question

I have a dataset with individuals names, addresses, phone numbers, etc. Some individuals appear multiple times, with slightly varying names/ and/or addressees and/or phone numbers. A snippet of the fake data is shown below:

first       last    address         phone    
Jimmy       Bamboo  P.O. Box 1190   xxx-xx-xx00
Jimmy W.    Bamboo  P.O. Box 1190   xxx-xx-xx22
James West  Bamboo  P.O. Box 219    xxx-66-xxxx

... and so on. Some times E. is spelled out as east, St. as Street, at other times they are not.

What I need to do is run through almost 120,000 rows of data to identify each unique individual based on their names, addresses, and phone numbers. Anyone have a clue as to how this might be done without manually running through each record, one at a time? The more I stare at it the more I think its impossible without making some judgment calls and saying if at least two or three fields are the same treat this as a single individual.

thanks!!

Ani

This sounds like a job for `regular expressions`. Won't be easy with regular expressions alone though. (For example, you could easily match anything that's like JxxxxxxBamboo. But it will then also match Jacintha Bamboo.) So you might have to find a list of common nicknames. If you look into R's `text mining` tools, maybe there are functions that handle this kind of stuff automagically? — Tommy O'Dell, Feb 28 '13 at 00:08
The hard route, trying to fix at least the names so that they are consistent across records. — Ani, Feb 28 '13 at 00:09
I'd start with finding entries that match exactly on a particular field, and then checking if the other fields are approximately the same. See how much you can whittle the data set down using exact matches before you start worrying about fuzzy matches. — Marius, Feb 28 '13 at 00:09
@Ani you show 3 observations. I don't know if you consider that the 3 are similar or not. Put some effort to create a representative cases of your datas it is half way the solution. Whatever, I think it is a job of `agrep` here. But I need more examples of and similar/not similar output. — agstudy, Feb 28 '13 at 00:12
This is not a trivial task. There are so many things to consider. For example, a person with a hispanic name of first name Vivian, last name Hernandez-Montoya may very well be entered as first name Bibianne Ernandes, last name Montoya. I work with this stuff all the time, and you would not believe the ways data entry people can mangle names and birthdates. But they expect the software to magically find things that sometimes don't even look the same. Just standardizing addresses by itself is not trivial. — hatchet - done with SOverflow, Feb 28 '13 at 00:14
Thanks everyone; I'll try and work through some of the leads re: levenshtein, text mining tools, etc. and if that doesn't work I'll try to find matches on 2 or more fields and report back. thanks again! — Ani, Feb 28 '13 at 00:34

score 3 · Accepted Answer · edited May 23 '17 at 10:34

As I mentioned in the comments, this is not trivial. You have to decide the trade-off of programmer time/solution complexity with results. You will not achieve 100% results. You can only approach it, and the time and complexity cost will increase the closer to 100% you get. Start with an easy solution (exact matches), and see what issue most commonly causes the missed matches. Implement a fuzzy solution to address that. Rinse and repeat.

There are several tools you can use (we use them all).

1) distance matching, like Damerau Levenshtein . you can use this for names, addresses and other things. It handles error like transpositions, minor spelling, omitted characters, etc.

2) phonetic word matching - soundex is not good. There are other more advanced ones. We ended up writing our own to handle the mix of ethnicities we commonly encounter.

3) nickname lookups - many nicknames will not get caught by either phonetic or distance matching - names like Fanny for Frances. There are many nicknames like that. You can build a lookup of nicknames to regular name. Consider though the variations like Jennifer -> Jen, Jenny, Jennie, Jenee, etc.

Names can be tough. Creative spelling of names seems to be a current fad. For instance, our database has over 30 spelling variations of the name Kaitlynn, and they are all spellings of actual names. This makes nickname matching tough when you're trying to match Katy to any of those.

Here are some other answers on similar topics I've made here on stackoverflow:

Processing of mongolian names

How to solve Dilemma of storing human names in MySQL and keep both discriminability and a search for similar names?

MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

score 0 · Answer 2 · answered Feb 28 '13 at 00:02

0

You can calculate the pairwise matrix of Levenshtein distances.

See this recent post for more info: http://www.markvanderloo.eu/yaRb/2013/02/26/the-stringdist-package/

answered Feb 28 '13 at 00:02

Ari B. Friedman

71,271
35
175
235

Fixing string variables with varying spellings, etc

2 Answers2