Pandas - Merge one dataframe with itself only partially

Question

This is a follow up question from the following Question: Pandas Similarity Matching

The ultimate goal of the first question was to find a way to similarity match each row with another if they have the same CountryId.

Here is the sample dataframe:

 df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])

The answer in other thread was good for the question but I ended up getting computational problems. My real source contains >19.000 rows and will be even bigger in the future.

The answer suggested to merge the dataframe with each self to compare it with every other row that has the same CountryId:

df = df.merge(df, on='CountryId', how='outer')

Even for the small example of 15 rows provided above we will end up with 225 merged rows. For the whole dataset I ended up with 131.044.638 rows which made my RAM refuse to work. Therefore I need to think of a better way to mergethe two dataframes.

As I´m doing a similarity check I was wondering if there is a possibility to:

Sort the dataframe based on the CountryId and the Name
Only merge each row with the +/- 3 rows connecting. E.g. After sorting Row 1 will only be merged with (2,3 & 4) as this is the first row., Row 2 will only be merged with (1, 3, 4, 5) and so on.

Like this I will have similar names almost next to each other and names "further away" will not be similar anyway. Therefore its not needed to check the similarity of them.

cross join will have those type of problem – BENY Jan 17 '20 at 11:55 — BENY, Jan 17 '20 at 11:55

score 0 · Answer 1 · answered Jan 17 '20 at 14:06

I found a workaround for my problem that is taking the 3 rows before (if existing) and after.

sorted_df = df.sort_values(by=['CountryId','Name']).reset_index(drop=True)
new_sorted = pd.Series()
min = -3
max = 3
for s in list(range(min,max+1,1)):
    if s == min:
        new_sorted = sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')
    elif s != 0:
        new_sorted = new_sorted + '-' + sorted_df['PartnerId'].astype(str).shift(s,fill_value='A').rename('MatchingID')


match = sorted_df.merge(new_sorted,left_index=True,right_index=True)

matching_df = []
for index, row in match.iterrows():
    row_values = row.tolist()
    matching_df += [row_values[0:-1] + [int(w)] for w in row_values[-1].split('-') if w != 'A']

If anyone can come up with a better idea I would be glad to hear it!

Pandas - Merge one dataframe with itself only partially

1 Answers1

Linked