0

i want to obtain proper list of marks and models of boats from two dataset (one lambda an another of reference) with fuzzywuzzy (levensthein model in python) but i have an issue in my code that i don't understand

the two datasets:

https://www.transfernow.net/dl/202203070QxpVjYJ

there is my code :

   #%%
    from fuzzywuzzy import process
    import pandas as pd
    
    #%%
    BASE_LAMBDA_PATH = '../ressources/marques_modeles_lambda_entier.csv'
    BASE_REF_PATH = '../ressources/marques_modeles_ref_entier.csv'
    #%%
    lambda_df = pd.read_csv(BASE_LAMBDA_PATH, sep=";")
    #%%
    ref_df = pd.read_csv(BASE_REF_PATH, sep=";")
    
    #%% j'ai créé ma liste de résultat (initée à vide)
    df_result = pd.DataFrame(columns=['marque', 'lambda','ref','score'])
    
    #%% je parcours ma table de modèles lambda
    for ind in lambda_df.index:
        marque = lambda_df['MARQUE_REF'][ind]
        modele_lambda = lambda_df['MODELE'][ind]
        ref_list = (ref_df[(ref_df['lib_marque'] == marque)]['lib_model']).to_list()
        choices = process.extract(modele_lambda, ref_list, limit=1)
        approx = choices[0][0]
        score = choices[0][1]
        df2 = pd.DataFrame(data = [(marque, modele_lambda, approx, score)],\
             columns=['marque', 'lambda','ref','score'])
        df_result = pd.concat([df_result, df2], axis=0, ignore_index=True)
    
    df_result.to_csv('output_matching_groupe.csv', sep=';', index=False)
    
    '''
    tdep = time.time()
    tfin = time.time()
    print(f"duree de {tfin-tdep} secondes")
    '''
    # %%

the error:

    IndexError                                Traceback (most recent call last)
    c:\Users\boats\src\list_matching_groupe.py in <cell line: 1>()
          20 ref_list = (ref_df[(ref_df['lib_marque'] == marque)]['lib_model']).to_list()
          21 choices = process.extract(modele_lambda, ref_list, limit=1)
    ----> 22 approx = choices[0][0]
          23 score = choices[0][1]
          24 df2 = pd.DataFrame(data = [(marque, modele_lambda, approx, score)],\
          25      columns=['marque', 'lambda','ref','score'])
    
    IndexError: list index out of range

I don't understand it because choices[0][0] actually works i obtain: 'Guy Couach 1401'

YannP
  • 25
  • 7
  • can you check which index of the loop fails ? You say `choices[0][0]` works but maybe you have a row for which the format is different and therefore it doesn't work. You have empty lines in your source file which creates the error. – Ssayan Mar 07 '22 at 14:19
  • I just did, it looks like its the line : "Camper & Nicholson;ENDEAVOUR 42" I think its the "&" which cause such an error. Do you know how to ignore special characters with regex for this case ? I deleted nan also, they were blocking the process – YannP Mar 07 '22 at 14:25
  • I don't know much about regex but that should be easy to find, threads like [this one](https://stackoverflow.com/questions/43358857/how-to-remove-special-characters-except-space-from-a-file-in-python) should do the trick. – Ssayan Mar 07 '22 at 14:33

0 Answers0