What's a better way to clean data before a merge?

Question

I have two different data frames that I need to merge and the merge column ('title') needs to be cleaned up before the merge can happen. Sample data example looks like the following;

data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})

data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})

As expected the merge will not succeed on the first title. I have been using the replace() method but it get's unmanageable very quickly because I have 100s of titles to correct due to spellings, case sensitivity, etc.

Any other suggestions regarding how to best cleanup and merge the data?

Full Example:

import pandas as pd
import numpy as np

data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9'], 'title': ['a.b. company','company_b']})

data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']}) 

data2['title'].replace(regex=True,inplace=True,to_replace=r"\s\(.*\)",value=r'')

replacements = {
   'title': {
        r'a.b. company *.*': 'ab company'
    }
}
data1.replace(replacements, regex=True, inplace=True)

pd.merge(data1, data2, on='title')

I can't think of a better way than this ... what are you imagining? — maxymoo, Nov 07 '16 at 22:29
You could use the [`fuzzywuzzy`](https://pypi.python.org/pypi/fuzzywuzzy) package, and use the function `ratio` for instance. — IanS, Nov 08 '16 at 09:27

score 4 · Accepted Answer · edited May 23 '17 at 12:19

First things first, there is no perfect solution for this problem, but I suggest doing two things:

Do any easy cleaning you can do before hand, including removing any characters you don't expect.
Apply some fuzzy matching logic

You'll see this isn't perfect since even this example doesn't work 100% percent.

First, let's start by making your example a tiny bit more complicated, introducing a regular typo (coampany_b instead of company_b, something that won't get picked up by the easy cleaning below)

data1 = pd.DataFrame({'id': ['a12bcde0','b20bcde9', 'csdfsjkbku'], 'title': ['a.b. company','company_b', 'coampany_b']})
data2 = pd.DataFrame({'serial_number': ['01a2b345','10ab2030','40ab4060'],'title':['ab company','company_b (123)','company_f']})

Then let's assume you only expect [a-z] characters as @Maarten Fabré mentioned. So let's lowercase everything and remove anything else.

data1['cleaned_title'] = data1['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')
data2['cleaned_title'] = data2['title'].str.lower().replace(regex=True,inplace=False,to_replace=r"[^a-z]", value=r'')

Now, let's use difflib's get_close_matches (read more and other options here)

import difflib
data1['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])
data2['closestmatch'] = data1.cleaned_title.apply(lambda x: difflib.get_close_matches(x, data2.cleaned_title)[0])

Here is the resulting data1, looking good!

    id          title           cleaned_title   closestmatch
0   a12bcde0    a.b. company    abcompany       abcompany
1   b20bcde9    company_b       companyb        companyb
2   csdfsjkbku  coampany_b      coampanyb       companyb

Now, here is data2, looking a bit less good... We asked it to find the closest match, so it found one for company_f, while you clearly don't want it.

    serial_number   title           cleaned_title   closestmatch
0   01a2b345        ab company      abcompany       abcompany
1   10ab2030        company_b (123) companyb        companyb
2   40ab4060        company_f       companyf        companyb

The ideal case scenario is if you have a clean list of company titles on the side, in which case you should find the closest match based on that. If you don't, you'll have to get creative or manually clean up the hit and miss.

To wrap this up, you can now perform a regular merge on 'closestmatch'.

score 0 · Answer 2 · answered Nov 07 '16 at 22:35

0

You could try to make an simplified_name column in each of the 2 dataframes by setting all characters to lowercase and removing all the non [a-z ] characters and join on this column if this doesn't lead to collisions

answered Nov 07 '16 at 22:35

Maarten Fabré

6,938
1
17
36

What's a better way to clean data before a merge?

2 Answers2