This is a follow up question from the following Question: Pandas Similarity Matching
The ultimate goal of the first question was to find a way to similarity match each row with another if they have the same CountryId
.
Here is the sample dataframe:
df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])
The answer in other thread was good for the question but I ended up getting computational problems. My real source contains >19.000 rows and will be even bigger in the future.
The answer suggested to merge
the dataframe with each self to compare it with every other row that has the same CountryId
:
df = df.merge(df, on='CountryId', how='outer')
Even for the small example of 15 rows provided above we will end up with 225 merged rows. For the whole dataset I ended up with 131.044.638 rows which made my RAM refuse to work. Therefore I need to think of a better way to merge
the two dataframes.
As I´m doing a similarity check I was wondering if there is a possibility to:
Sort the dataframe based on the
CountryId
and theName
Only merge each row with the +/- 3 rows connecting. E.g. After sorting Row 1 will only be merged with (2,3 & 4) as this is the first row., Row 2 will only be merged with (1, 3, 4, 5) and so on.
Like this I will have similar names almost next to each other and names "further away" will not be similar anyway. Therefore its not needed to check the similarity of them.