I'm attempting to clean up a database that, over the years, had acquired many duplicate records, with slightly different names. For example, in the companies table, there are names like "Some Company Limited" and "SOME COMPANY LTD!".
My plan was to export the offending tables into R, convert names to lower case, replace common synonyms (like "limited" -> "ltd"), strip out non-alphabetic characters and then use agrep
to see what looks similar.
My first problem is that agrep
only accepts a single pattern to match, and looping over every company name to match against the others is slow. (Some tables to be cleaned will have tens, possibly hundreds of thousands of names to check.)
I've very briefly looked at the tm
package (JSS article), and it seems very powerful but geared towards analysing big chunks of text, rather than just names.
I have a few related questions:
Is the
tm
package appropriate for this sort of task?Is there a faster alternative to
agrep
? (Said function uses the Levenshtein edit distance which is anecdotally slow.)Are there other suitable tools in R, apart from
agrep
andtm
?Should I even be doing this in R, or should this sort of thing be done directly in the database? (It's an Access database, so I'd rather avoid touching it if possible.)