I'm working on a python script to perform the following tasks.
- Creates pandas df and finds matches based on the sku number.
- Compares the sku rows in the 2 df's
- Finds the diff in price and quantity between the 2 sources
- Outputs a new df and CSV with the found diffs
The CSV has more than 70k rows. A subset of the dataset for supplier.csv
sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
3131,120.26,"HP - 3 ml - svart - original - bläckpatron - för QuietJet Plus, ThinkJet",382,0.03,,HP,51604A,,
3135,132.33,"HP - Röd - original - bläckpatron - för QuietJet Plus, ScanJet Enterprise Flow N9120 Flatbed Scanner",3,0.03,,HP,51605R,,
A subset of the dataset for local.csv
sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
259372,445.78,"HP 45 Large - 42 ml - Stor - svart - original - bläckpatron - för Deskjet 12XX, Officejet g55, g85, k60, R40, R60, R80, T45, T65, Officejet Pro 11XX",0,0.15,,,,,
365684,1512.91,"HP 80 - 350 ml - cyan - original - DesignJet - bläckpatron - för DesignJet 1050c, 1050c plus, 1055cm, 1055cm plus",0,0.64,,,,,
Here's the code
def find_new_and_diff_dataframes(local_filename, supplier_filename):
df_local = pd.read_csv(local_filename)
df_supplier = pd.read_csv(supplier_filename)
df_new_output = df_supplier[~df_supplier.sku.isin(df_local['sku'].to_list())]
df_diff_output = pd.DataFrame()
supplier_copy = df_supplier.to_dict("records")
for supplier_row in supplier_copy:
local_index = df_local[df_local['sku'] == supplier_row['sku']].index.values
if len(local_index) == 1:
local_row = df_local.iloc[local_index[0]]
if supplier_row['price'] != local_row['price'] or supplier_row['quantity'] != local_row['quantity']:
df_diff_output = df_diff_output.append(supplier_row, ignore_index=True)
df_diff_output = df_diff_output[["sku", "price", "quantity"]]
df_diff_output['sku'] = df_diff_output['sku'].astype(int)
df_diff_output['quantity'] = df_diff_output['quantity'].astype(int)
return df_new_output, df_diff_output
The problem is that it takes a lot of time to generate the result. Around 220-240 seconds.
Is there any other method which I can use to lower the compute time?