3

I tried searching the answer in SO but didnt find any help.

Here is what I´m trying to do:
I have a dataframe (here is a small example of it):

 df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])

My goal is to find the PartnerIds which Name is similar at least up to a certain ratio.
Additionally I only want to compare PartnerIds that have the same CountryId. The matching PartnerIds should be appended to a list and finally written in a new column in the dataframe.

Here is my try:

itemDict = {item[0]: {'CountryId': item[1], 'Name': item[2]} for item in df.values}

from difflib import SequenceMatcher
def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

def calculate_similarity(x,itemDict):
    own_name = x['Name']
    country_id = x['CountryId']
    matching_ids = []
    for k, v in itemDict.items():

        if k != x['PartnerId']:
            if v['CountryId'] == country_id:

                ratio = similar(own_name,v['Name'])


                if ratio > 0.7:

                    matching_ids.append(k)
    return matching_ids

df['Similar_IDs'] = df.apply(lambda x: calculate_similarity(x,itemDict),axis=1)
print(df)

The output is:

    PartnerId  CountryId                          Name Similar_IDs
0           1          5            AADDEEEEIILMNORRTU          []
1           2          5                AACEEEEGMMNNTT          []
2           3          5  AAACCCCEFHIILMNNOPRRRSSTTUUY        [15]
3           4          5               DEEEGINOOPRRSTY        [10]
4           5          5          AACCDEEHHIIKMNNNNTTW          []
5           6          5              ACEEHHIKMMNSSTUV          []
6           7          5                ACELMNOOPPRRTU          []
7           8          5                           BIT          []
8           9          5                           APR          []
9          10          5              CDEEEGHILLLNOOST         [4]
10         11          5                        ACCMNO          []
11         12          5                           AIK          []
12         13          5              CCHHLLOORSSSTTUZ          []
13         14          5                       ANNOSXY          []
14         15          5  AABBCEEEEHIILMNNOPRRRSSTUUVY         [3]

My questions now are:
1.) Is there a more efficient way to compute it? I have about 20.000 rows now and a lot more in the near future.
2.) Is it possible to get "rid" of the itemDict and do it directly from the dataframe?
3.) Is another distance measure maybe better to use?

Thanks a lot for your help!

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
mgruber
  • 751
  • 1
  • 9
  • 26

1 Answers1

3

You can use the module difflib. First, you need to make a cartesian product of all strings by joining the table to itself using outer join:

cols = ['Name', 'CountryId', 'PartnerId']
df = df[cols].merge(df[cols], on='CountryId', how='outer')    
df = df.query('PartnerId_x != PartnerId_y')

In the next step you can apply the function from this answer and filter out all matches:

def match(x):
    return SequenceMatcher(None, x[0], x[1]).ratio()

match = df.apply(match, axis=1) > 0.7
df.loc[match, ['PartnerId_x', 'Name_x', 'PartnerId_y']]

Output:

     PartnerId_x                        Name_x  PartnerId_y
44             3  AAACCCCEFHIILMNNOPRRRSSTTUUY           15
54             4               DEEEGINOOPRRSTY           10
138           10              CDEEEGHILLLNOOST            4
212           15  AABBCEEEEHIILMNNOPRRRSSTUUVY            3

If you don't have enough memory you can try to iterate over the rows of a data frame:

lst = []
for idx, row in df.iterrows():
    if SequenceMatcher(None, row['Name_x'], row['Name_y']).ratio() > 0.7:
        lst.append(row[['PartnerId_x', 'Name_x', 'PartnerId_y']])

pd.concat(lst, axis=1).T
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
  • Looks good so far! in your description you are writing "inner join" but then using "Outer" join afterwards. I guess it doesn´t matter anyway! – mgruber Jan 17 '20 at 08:31
  • @mgruber I meant self join. Thank you for correction. – Mykola Zotko Jan 17 '20 at 08:39
  • I tried to use the code for my 19.053 rows and ended up getting a ```MemoryError```. – mgruber Jan 17 '20 at 08:42
  • @mgruber It looks like you don't have enough RAM. – Mykola Zotko Jan 17 '20 at 08:44
  • @MykolaZotko the merge takes around 1 minute already. The Error is thrown during the matching. The length of the merged dataframe is 131.044.638 which is obviously way to much to work on for my 'poor' machine. I thought about if there is a way to first order the df based on CountryId and Name and then only merge it with 5 entries above and below. The other >19k rows will not be a match anyway. – mgruber Jan 17 '20 at 08:48
  • @mgruber You can try to iterate over the merged data frame (s. my answer) – Mykola Zotko Jan 17 '20 at 08:59
  • @mgruber You can also use the function `product` from `itertools` to build a cartesian product. It produces a generator and doesn't need to save the whole data frame. – Mykola Zotko Jan 17 '20 at 09:02
  • @thanks for your suggestion. I will try it now! Do you have an idea how I can try what I suggested above? Only merge each row with +/- 5 rows above and below? – mgruber Jan 17 '20 at 09:06
  • Sorry, I don't have time now. You can make another question. I see that you work as a Data Scientist. Maybe your boss has a position for me. I am currently looking for a job in Germany – Mykola Zotko Jan 17 '20 at 09:16
  • https://stackoverflow.com/questions/59784767/pandas-merge-one-dataframe-with-itself-only-partially Thank you for your help! Unfortunately all positions are taken. – mgruber Jan 17 '20 at 09:56