1

I'm trying to merge 2 DataFrames using fuzzy-wuzzy. the df1 data frame is a table with sales items (150K rows and 30 cols) and df2 is a table (30K lines and 10 cols) with reference data for sales item name ('Defproduto'). I don't have reference data for the sales items but some of them I know they share the same reference data although the name is slightly different.

Example: I have ref data for sales item aa_XXX_yy_RENT17_zzz_www_class but I don't have for sales item aa_XXX_yy_VER17_zzz_www_class. Using fuzzy-wuzzy I can get the ref data for aa_XXX_yy_RENT17_zzz_www_class and assign it in the sales table for all sales Items with name aa_XXX_yy_VER17_zzz_www_class

I'm using the follwing code:

    import pandas as pd
    df1 = pd.read_excel('/.../dummy_data_examples.xlsx', sheet_name='Sheet1')
    df2 = pd.read_excel('/.../refdata_examples.xlsx', sheet_name='Sheet1')
    df1['price'] = df1.apply(lambda row:df2.loc[process.extract(row.Defproduto,
                             df2['Defproduto'],limit=1,scorer=fuzz.ratio)[0][2],].price,axis=1)

Problem: it is dreadfully slow. It takes about 40sec just for a sample of 100 lines. I have 150K to process (and need to update 5 different characteristics of df1, not just price).

Is there a more efficient/fast process to do this? I've tried different approaches with pandas merge but none able to get the results to need.

Thank you in advance

Sachith Wickramaarachchi
  • 5,546
  • 6
  • 39
  • 68

0 Answers0