1

I am investigating the best way to analyse some very messy employer name data (Free Text Field in an application). Each employer might have 100's of different variations when it comes to spelling, abbreviations and even language.

What is the best and most accurate way to solve this issue and what tool will work the best? I am currently using Base SAS, but I don't think this has the capability to do what I require.

HermannHH
  • 1,732
  • 1
  • 27
  • 57
  • 1
    In base there is a function called SPEDIS, that "Determines the likelihood of two words matching, expressed as the asymmetric spelling distance between the two words". Maybe it could be used to clean up the data? There is a couple of other text-comparison functions you might want to look at: COMPLEV "Returns the Levenshtein edit distance between two strings". COMPGED: "Returns the generalized edit distance between two strings" – Stig Eide Apr 30 '15 at 13:27
  • Also `soundex()` - http://stackoverflow.com/a/29373722/214994 – Robert Penridge May 01 '15 at 15:14

1 Answers1

0

I'm not aware of any tool that can magically solve this problem for you. As already suggested; I'd try to compare each employer name in your list to every other one and check edit distances. There's a really easy to use Levenshtein function for this, just:

import Levenshtein
ed = Levenshtein.distance('someString', 'someOtherString')

You may want to normalize this in way:

normalized = ed / max(len('someString'), len('someOtherString'))

Then go through the list and clean up based on this. Doing some kind of word or ngram frequency analysis can also quickly lead you to some frequent abbreviations that you may want to unify ('inc.' and 'incorporated', 'ltd.' and 'limited', etc.)

Igor
  • 1,251
  • 10
  • 21