0

I am using python 3. I am trying to loop over dataframe faster. My First dataframe data_1_8_Projets_tmp has about 2000 string values in the column Key_ID, my second dataframe Base_Siren_Data has about 5 000 000 string values in the column Key_ID.

My Problem :

I'am looking for the fastest way to compare my 2000 values from data_1_8_Projets_tmp dataframe with my 5 Million in order the take the value that match the most.

I am using the library Fuzzy to check the similarity between 2 string value with a threshold of 80%

Here is my code, it is not good because it took really many hours to finish running :

start = time.time()
def GetSimilarSiret(dataValue,df_Siren_Value): 
    result  = match.extract(dataValue, df_Siren_Value, match_type='jaro_winkler', score_cutoff=0.85)
    if result == None :
        return 'nan'
    
    elif(len(result)>0):
        return result[0][0]
        
    else:
        return 'nan'
        
        
    
    

data_1_8_Projets_tmp['MachedValue'] = data_1_8_Projets_tmp.apply(lambda x: GetSimilarSiret(x["Key_ID"],Base_Siren_Data["Key_ID"]),axis=1)

print(time.time() - start, ' seconds')

Thanks in advance

hk77
  • 59
  • 8
  • Hi and welcome on SO. It will be great if you can have a look at [ask] and then try to produce a [mcve]. – rpanai Jul 05 '23 at 12:57
  • can you share your dfs as csv files for testing? – RomanPerekhrest Jul 05 '23 at 13:07
  • 1
    Does this help? [Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column](https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column) Beside a good underlying structure you also need a fast, best vectorization/parallelizable fuzzy implementation. Linked is one that is in C++ that might already help too. – Daraan Jul 05 '23 at 13:13
  • The base is called 'siren', and the function name mention 'siret'. By any chance, are you not referring to the French "système d'identification du répertoire des entreprises" (siren) and "des établissements" (siret), so 9 and 14 digits, unique ID? Because if so, it makes no sense to use fuzzy matching (it would be like trying to find people in national database whose social security number is almost the same as the one of the person you are interested in. But "almost the same" means different. It is a binary thing. It is not like related people had similar ssn) – chrslg Jul 05 '23 at 14:02
  • So just do an exact match on 9 digits (or an exact match on the first 9 digits on the 14 digits for siret, with priority to the one that match the whole 14 digits, since 9 first digits identical, means "same company", and the 5 remaining digits identical means "and same unit") – chrslg Jul 05 '23 at 14:04
  • And if that has nothing to do with french siret/siren, well, you'll have to find another way than a fuzzy match library. You need vectorized matching. Using an external library you are doomed to loop "all vs all" in python, without taking advantage from index, or at least from pandas vectorization. – chrslg Jul 05 '23 at 14:05
  • @chrslg I am matching companies names with another companies data source. there is not match with siret – hk77 Jul 05 '23 at 14:10
  • @Daraan I am looking the article – hk77 Jul 05 '23 at 14:10

0 Answers0