50

I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".

My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep to see what looks similar.

My first problem is that agrep only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)

I've very briefly looked at the tm package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.

I have a few related questions:

  1. Is the tm package appropriate for this sort of task?

  2. Is there a faster alternative to agrep? (Said function uses the Levenshtein edit distance which is anecdotally slow.)

  3. Are there other suitable tools in R, apart from agrep and tm?

  4. Should I even be doing this in R, or should this sort of thing be done directly in the database? (It's an Access database, so I'd rather avoid touching it if possible.)

Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
  • 5
    Related to [How to measure similarity between strings?](http://stackoverflow.com/questions/6044112/r-how-to-measure-similarity-between-strings) – Joshua Ulrich Jul 13 '11 at 18:03

4 Answers4

34

If you're just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering.

I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own function that use my own weighting scheme (also, as it is, you can't use soundex() for big data sets with RecordLinkage).

If I have two lists of names that I want to match ("record link"), then I typically convert both to lower case and remove all punctuation. To take care of "Limited" versus "LTD" I typically create another vector of the first word from each list, which allows extra weighting on the first word. If I think that one list may contain acronyms (maybe ATT or IBM) then I'll acronym-ize the other list. For each list I end up with a data frame of strings that I would like to compare that I write as separate tables in a MySQL database.

So that I don't end up with too many candidates, I LEFT OUTER JOIN these two tables on something that has to match between the two lists (maybe that's the first three letters in each list or the first three letters and the first three letters in the acronym). Then I calculate match scores using the above functions.

You still have to do a lot of manual inspection, but you can sort on the score to quickly rule out non-matches.

Richard Herron
  • 9,760
  • 12
  • 69
  • 116
  • 5
    +1 for explaining how to normalize text. All to often overlooked "first step". tolower(), gsub(). I do something very similar by looking at summary(as.factor(my_vector)) and seeing what's not matching up. Sometimes it's really very simple and writing out the lines can be a lot cleaner than trying to be fancy with regex. – Brandon Bertelsen Jul 14 '11 at 06:30
  • 2
    @AndrewMedico yes, it looks like the package is no longer active in CRAN. You can get past versions from archive. Am I obligated to become the package maintainer? – Richard Herron Jul 13 '14 at 23:42
  • @RichardHerron thank for pointing to the packages, it is on CRAN now. Do you have a clude what to do if I have numerical columns only and they differ by random fluctuatons and I want to find matches in numerical data? – Richi W Mar 21 '17 at 15:34
  • @Richard the package doesn't seem to be on CRAN anymmore. Can you please look into it? Thanks! – questionmark Nov 08 '20 at 17:49
  • @questionmark which Richard do you mean (Richard or RichardHerron)? And which package? This one: https://cran.r-project.org/web/packages/RecordLinkage/index.html is on CRAN as I see it. – Richi W Nov 12 '20 at 07:16
9

Maybe google refine could help. It looks maybe more fitted if you have lots of exceptions and you don't know them all yet.

Etienne Racine
  • 1,323
  • 1
  • 11
  • 25
  • I have not used google refine but I was impressed by the videos. It seems to be good at handling fuzzy matching and best of all, it appears as if one can save the underling code so that one can run it if needed again or if one wants to run the same algorithm on similar data. – Farrel Jul 14 '11 at 19:56
6

What you're doing is called record linkage, and it's been a huge field of research over many decades already. Luckily for you, there's a whole bunch of tools out there that are ready-made for this sort of thing. Basically, you can point them at your database, set up some cleaning and comparators (like Levenshtein or Jaro-Winkler or ...), and they'll go off and do the job for you.

These tools generally have features in place to solve the performance issues, so that even though Levenshtein is slow they can run fast because most record pairs never get compared at all.

The Wikipedia link above has links to a number of record linkage tools you can use. I've personally written one called Duke in Java, which I've used successfully for exactly this. If you want something big and expensive you can buy a Master Data Management tool.

larsga
  • 654
  • 6
  • 10
0

In your case probably something like edit-distance calculation would work, but if you need to find near duplicates in larger text based documents, you can try http://www.softcorporation.com/products/neardup/

vadim
  • 1