I tried searching the answer in SO but didnt find any help.
Here is what I´m trying to do:
I have a dataframe (here is a small example of it):
df = pd.DataFrame([[1, 5, 'AADDEEEEIILMNORRTU'], [2, 5, 'AACEEEEGMMNNTT'], [3, 5, 'AAACCCCEFHIILMNNOPRRRSSTTUUY'], [4, 5, 'DEEEGINOOPRRSTY'], [5, 5, 'AACCDEEHHIIKMNNNNTTW'], [6, 5, 'ACEEHHIKMMNSSTUV'], [7, 5, 'ACELMNOOPPRRTU'], [8, 5, 'BIT'], [9, 5, 'APR'], [10, 5, 'CDEEEGHILLLNOOST'], [11, 5, 'ACCMNO'], [12, 5, 'AIK'], [13, 5, 'CCHHLLOORSSSTTUZ'], [14, 5, 'ANNOSXY'], [15, 5, 'AABBCEEEEHIILMNNOPRRRSSTUUVY']],columns=['PartnerId','CountryId','Name'])
My goal is to find the PartnerId
s which Name
is similar at least up to a certain ratio
.
Additionally I only want to compare PartnerId
s that have the same CountryId
. The matching PartnerId
s should be appended to a list and finally written in a new column in the dataframe.
Here is my try:
itemDict = {item[0]: {'CountryId': item[1], 'Name': item[2]} for item in df.values}
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
def calculate_similarity(x,itemDict):
own_name = x['Name']
country_id = x['CountryId']
matching_ids = []
for k, v in itemDict.items():
if k != x['PartnerId']:
if v['CountryId'] == country_id:
ratio = similar(own_name,v['Name'])
if ratio > 0.7:
matching_ids.append(k)
return matching_ids
df['Similar_IDs'] = df.apply(lambda x: calculate_similarity(x,itemDict),axis=1)
print(df)
The output is:
PartnerId CountryId Name Similar_IDs
0 1 5 AADDEEEEIILMNORRTU []
1 2 5 AACEEEEGMMNNTT []
2 3 5 AAACCCCEFHIILMNNOPRRRSSTTUUY [15]
3 4 5 DEEEGINOOPRRSTY [10]
4 5 5 AACCDEEHHIIKMNNNNTTW []
5 6 5 ACEEHHIKMMNSSTUV []
6 7 5 ACELMNOOPPRRTU []
7 8 5 BIT []
8 9 5 APR []
9 10 5 CDEEEGHILLLNOOST [4]
10 11 5 ACCMNO []
11 12 5 AIK []
12 13 5 CCHHLLOORSSSTTUZ []
13 14 5 ANNOSXY []
14 15 5 AABBCEEEEHIILMNNOPRRRSSTUUVY [3]
My questions now are:
1.) Is there a more efficient way to compute it? I have about 20.000 rows now and a lot more in the near future.
2.) Is it possible to get "rid" of the itemDict and do it directly from the dataframe?
3.) Is another distance measure maybe better to use?
Thanks a lot for your help!