I'm trying to merge 2 DataFrames using fuzzy-wuzzy
.
the df1
data frame is a table with sales items (150K rows and 30 cols) and df2
is a table (30K lines and 10 cols) with reference data for sales item name ('Defproduto
').
I don't have reference data for the sales items but some of them I know they share the same reference data although the name is slightly different.
Example: I have ref data for sales item aa_XXX_yy_RENT17_zzz_www_class
but I don't have for sales item aa_XXX_yy_VER17_zzz_www_class
.
Using fuzzy-wuzzy
I can get the ref data for aa_XXX_yy_RENT17_zzz_www_class
and assign it in the sales table for all sales Items with name aa_XXX_yy_VER17_zzz_www_class
I'm using the follwing code:
import pandas as pd
df1 = pd.read_excel('/.../dummy_data_examples.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('/.../refdata_examples.xlsx', sheet_name='Sheet1')
df1['price'] = df1.apply(lambda row:df2.loc[process.extract(row.Defproduto,
df2['Defproduto'],limit=1,scorer=fuzz.ratio)[0][2],].price,axis=1)
Problem: it is dreadfully slow. It takes about 40sec just for a sample of 100 lines. I have 150K to process (and need to update 5 different characteristics of df1, not just price).
Is there a more efficient/fast process to do this? I've tried different approaches with pandas merge but none able to get the results to need.
Thank you in advance