0

I have 2 data frames. One pulls values using an API of current values of certain fields on system. The other has the actual current values of those fields. Example: Names on the system and names on paper. I've merged the two on the common column but now am trying to compare the names on Python to see if they are an approximate match and/or if they need an update. Is there a way I can do this? I believe this can be done on excel using isnumber(search(...)).

Case Insensitive and possibly consider abbreviations (I can make a dictionary?) for comparing text strings

Example of how my dataframe looks and the result I want:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{font-family:Arial, sans-serif;font-size:14px;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg th{font-family:Arial, sans-serif;font-size:14px;font-weight:normal;padding:10px 5px;border-style:solid;border-width:1px;overflow:hidden;word-break:normal;border-color:black;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
  <tr>
    <th class="tg-0lax"></th>
    <th class="tg-0lax">Name on System</th>
    <th class="tg-0lax">Current Name</th>
    <th class="tg-0lax">Match</th>
  </tr>
  <tr>
    <td class="tg-0lax">1</td>
    <td class="tg-0lax">APPLE INFORMATION TECHNOLOGY</td>
    <td class="tg-0lax">Apple International Information Technology </td>
    <td class="tg-0lax">No</td>
  </tr>
  <tr>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax">IBM Intl group</td>
    <td class="tg-0lax">IBM International Group</td>
    <td class="tg-0lax">YES</td>
  </tr>
</table>

PS. Apologies in advance if I have broken any rules or etiquette of the Stack community, I'm new to this and open to learning and constructive criticism.

Espoir Murhabazi
  • 5,973
  • 5
  • 42
  • 73

2 Answers2

1

Maybe a good way would be calculating similarities, and return the highest match probability?

First of all you need to do some data cleaning like removing special characters, convert all strings to lowercase then search using similarity

st1 = 'apple information technology'
st2 = 'apple international information technology'

from difflib import SequenceMatcher
SequenceMatcher(None, st1, st2).ratio() 
amo3tasem
  • 140
  • 6
1

Well you can learn more about the string similarity difference here Find the similarity metric between two strings

I am just trying to provide you the application in case you want to try different method using pandas and your metrics.

import pandas as pd
from difflib import SequenceMatcher
df=pd.DataFrame({'Name on System':['APPLE INFORMATION TECHNOLOGY','IBM Intl group'],'Current Name':['Apple International Information Technology','IBM International Group']})

Define your metric in function

def similiarity_ratio(row):
    return SequenceMatcher(None, row['Name on System'].lower(), row['Current Name'].lower()).ratio() 

df['Match']=df.apply(lambda x:similiarity_ratio(x),axis=1)
print(df)

Output

Current Name                                    Name on System                 Match
0   Apple International Information Technology  APPLE INFORMATION TECHNOLOGY  0.800000
1   IBM International Group                     IBM Intl group                0.756757
mad_
  • 8,121
  • 2
  • 25
  • 40
  • Thank you! difflib seems to be the package to explore. let's see if it does what I need it do. Thanks again for your help and guidance. – eagerlearner96 Oct 04 '18 at 16:36