1

I have an excel file that contains 1000+ company names in one column and about 20,000 company names in another column.

The goal is to match as many names as possible. The problem is that the names in column one (1000+) are poorly formatted, meaning that "Company Name" string can look something like "9Com(panynAm9e00". I'm trying to figure out the best way to solve this. (only 12 names match exactly)

After trying different methods, I've ended up with attempting to match 4-5 or more characters in each name, depending on the length of each string, using regex. But I'm just struggling to find the most efficient way to do this.

For instance:

Column 1

 1. 9Com(panynAm9e00 
 2. NikE4 
 3. Mitrosof2

Column 2

 1. Microsoft
 2. Company Name
 3. Nike

Take first element in Column 1 and look for a match in Column 2. If no exact match, then look for a string with 4-5 same characters.

Any suggestions?

rahlf23
  • 8,869
  • 4
  • 24
  • 54
0-cool
  • 11
  • 1

2 Answers2

6

I would suggest reading your Excel file with pandas and pd.read_excel(), and then using fuzzywuzzy to perform your matching, for example:

import pandas as pd
from fuzzywuzzy import process, fuzz

df = pd.DataFrame([['9Com(panynAm9e00'],
        ['NikE4'],
        ['Mitrosof2']],
        columns=['Name'])

known_list = ['Microsoft','Company Name','Nike']

def find_match(x):

  match = process.extractOne(x, known_list, scorer=fuzz.partial_token_sort_ratio)[0]
  return match

df['match found'] = [find_match(row) for row in df['Name']]

Yields:

               Name   match found
0  9Com(panynAm9e00  Company Name
1             NikE4          Nike
2         Mitrosof2     Microsoft
rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • this is basically the approach I've used before; depending on how much you know about the names involved you can do better than a vanilla "edit distance", for example the bioinformatics world has things like the Needleman–Wunsch algorithm that lets you say that certain changes matter more/less – Sam Mason Oct 03 '18 at 18:36
  • @rahlf23 great, thanks a lot! never used fuzzywyzzy lib so it's certainly good to know. I took a glance at the documentation and my guess is that I can also measure the distance as a score to evaluate how closely the strings match, correct? Cause when I was trying to run a test and used a pair of really lengthy strings that would only share a one common character, those would still appear as a match. Guessing looking at the distance (or score on a scale 0-100) can help eliminate poor matches? Once again, appreciate it. – 0-cool Oct 03 '18 at 19:33
  • Correct, you can change your `scorer`, see here: https://github.com/seatgeek/fuzzywuzzy/issues/137 And you can actually see the score each match receives if you change `extractOne` to `extract` and return the tuple instead of just the first index `[0]` like I have in my answer – rahlf23 Oct 03 '18 at 20:19
  • @rahlf23 wanted to ask for your help one more time here, given that fuzzywuzzy documentation is kinda unclear. Is there a way to set some sort of restrictions on how many characters should match in a pair to have the score of 100. The issue is that some 3 character strings, for instance, would match with incorrect names in case those 3 characters are in sequential order in those more lengthy strings (e.g. "Sam" would match with "Samsung"). Thanks a lot in advance. – 0-cool Oct 08 '18 at 15:00
  • That's a good question. You may have better luck in that case using `partial_token_set_ratio` as your `scorer`. Here is another helpful SO answer: https://stackoverflow.com/a/31823872/8146556 You may also think about limiting your options (`known_list` in my answer above) prior to performing the matching. – rahlf23 Oct 08 '18 at 15:04
0

I imagine numbers are not very common in actual company names, so an initial filter step will help immensely going forward, but here is one implementation that should work relatively well even without this. A bag-of-letters (bag-of-words) approach, if you will:

  1. convert everything (col 1 and 2) to lowercase
  2. For each known company in column 2, store each unique letter, and how many times it appears (count) in a dictionary
  3. Do the same (step 2) for each entry in column 1
  4. For each entry in col 1, find the closest bag-of-letters (dictionary from step 2) from the list of real company names

The dictionary-distance implementation is up to you.

Ravi Patel
  • 346
  • 2
  • 8