I am using python 3. I am trying to loop over dataframe faster. My First dataframe data_1_8_Projets_tmp
has about 2000 string values in the column Key_ID
, my second dataframe Base_Siren_Data
has about 5 000 000 string values in the column Key_ID
.
My Problem :
I'am looking for the fastest way to compare my 2000 values from data_1_8_Projets_tmp
dataframe with my 5 Million in order the take the value that match the most.
I am using the library Fuzzy to check the similarity between 2 string value with a threshold of 80%
Here is my code, it is not good because it took really many hours to finish running :
start = time.time()
def GetSimilarSiret(dataValue,df_Siren_Value):
result = match.extract(dataValue, df_Siren_Value, match_type='jaro_winkler', score_cutoff=0.85)
if result == None :
return 'nan'
elif(len(result)>0):
return result[0][0]
else:
return 'nan'
data_1_8_Projets_tmp['MachedValue'] = data_1_8_Projets_tmp.apply(lambda x: GetSimilarSiret(x["Key_ID"],Base_Siren_Data["Key_ID"]),axis=1)
print(time.time() - start, ' seconds')
Thanks in advance