Match and compare strings in Python Dataframe

Question

I have 2 data frames. One pulls values using an API of current values of certain fields on system. The other has the actual current values of those fields. Example: Names on the system and names on paper. I've merged the two on the common column but now am trying to compare the names on Python to see if they are an approximate match and/or if they need an update. Is there a way I can do this? I believe this can be done on excel using isnumber(search(...)).

Case Insensitive and possibly consider abbreviations (I can make a dictionary?) for comparing text strings

Example of how my dataframe looks and the result I want:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-0lax"></th>
    <th class="tg-0lax">Name on System</th>
    <th class="tg-0lax">Current Name</th>
    <th class="tg-0lax">Match</th>
  </tr>
  <tr>
    <td class="tg-0lax">1</td>
    <td class="tg-0lax">APPLE INFORMATION TECHNOLOGY</td>
    <td class="tg-0lax">Apple International Information Technology </td>
    <td class="tg-0lax">No</td>
  </tr>
  <tr>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax">IBM Intl group</td>
    <td class="tg-0lax">IBM International Group</td>
    <td class="tg-0lax">YES</td>
  </tr>
</table>

PS. Apologies in advance if I have broken any rules or etiquette of the Stack community, I'm new to this and open to learning and constructive criticism.

What are your criteria of comparison? They are not exact strings still one is Yes other is No — mad_, Oct 04 '18 at 14:13
@Chris I'm comparing columns, so IBM Intl group vs IBM International Group — eagerlearner96, Oct 04 '18 at 15:01
@mad_ approximate matches, so any matches above a certain predetermined threshold of a ratio would quantify a match? — eagerlearner96, Oct 04 '18 at 15:03
Possible duplicate of [Find the similarity metric between two strings](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings) — mad_, Oct 04 '18 at 15:07

score 1 · Accepted Answer · answered Oct 04 '18 at 14:27

1

Maybe a good way would be calculating similarities, and return the highest match probability?

First of all you need to do some data cleaning like removing special characters, convert all strings to lowercase then search using similarity

st1 = 'apple information technology'
st2 = 'apple international information technology'

from difflib import SequenceMatcher
SequenceMatcher(None, st1, st2).ratio()

answered Oct 04 '18 at 14:27

amo3tasem

140
6

1

Thanks! will explore this package – eagerlearner96 Oct 04 '18 at 15:06

score 1 · Answer 2 · answered Oct 04 '18 at 15:26

Well you can learn more about the string similarity difference here Find the similarity metric between two strings

I am just trying to provide you the application in case you want to try different method using pandas and your metrics.

import pandas as pd
from difflib import SequenceMatcher
df=pd.DataFrame({'Name on System':['APPLE INFORMATION TECHNOLOGY','IBM Intl group'],'Current Name':['Apple International Information Technology','IBM International Group']})

Define your metric in function

def similiarity_ratio(row):
    return SequenceMatcher(None, row['Name on System'].lower(), row['Current Name'].lower()).ratio() 

df['Match']=df.apply(lambda x:similiarity_ratio(x),axis=1)
print(df)

Output

Current Name                                    Name on System                 Match
0   Apple International Information Technology  APPLE INFORMATION TECHNOLOGY  0.800000
1   IBM International Group                     IBM Intl group                0.756757

Thank you! difflib seems to be the package to explore. let's see if it does what I need it do. Thanks again for your help and guidance. — eagerlearner96, Oct 04 '18 at 16:36

Match and compare strings in Python Dataframe

2 Answers2