I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.
Parsing
Let's use this odd name as an example.
name = "THE | big,- Pharma: LLC" # example of a company name
We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:
from cleanco import cleanco
name = cleanco(name).clean_name() # 'THE | big,- Pharma'
Remove all punctuation:
name = name.translate(None, string.punctuation) # 'THE big Pharma'
(for unicode strings, the following code works instead (source, regex):
import regex
name = regex.sub(ur"[[:punct:]]+", "", name) # u'THE big Pharma'
Split the name into tokens using NLTK:
import nltk
tokens = nltk.word_tokenize(name) # ['THE', 'big', 'Pharma']
Lowercase all tokens:
tokens = [t.lower() for t in tokens] # ['the', 'big', 'pharma']
Remove stop words. Note that it might cause problems with companies like On Mars
will be incorrectly matched to Mars
, because On
is a stopword.
from nltk.corpus import stopwords
tokens = [t for t in tokens if t not in stopwords.words('english')] # ['big', 'pharma']
I don't cover accented and special characters here (improvements welcome).
Matching
Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.
To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.