Speedup finding differing and new rows in 2 dataframes

Question

I'm working on a python script to perform the following tasks.

Creates pandas df and finds matches based on the sku number.
Compares the sku rows in the 2 df's
Finds the diff in price and quantity between the 2 sources
Outputs a new df and CSV with the found diffs

The CSV has more than 70k rows. A subset of the dataset for supplier.csv

sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
3131,120.26,"HP - 3 ml - svart - original - bläckpatron - för QuietJet Plus, ThinkJet",382,0.03,,HP,51604A,,
3135,132.33,"HP - Röd - original - bläckpatron - för QuietJet Plus, ScanJet Enterprise Flow N9120 Flatbed Scanner",3,0.03,,HP,51605R,,

A subset of the dataset for local.csv

sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
259372,445.78,"HP 45 Large - 42 ml - Stor - svart - original - bläckpatron - för Deskjet 12XX, Officejet g55, g85, k60, R40, R60, R80, T45, T65, Officejet Pro 11XX",0,0.15,,,,,
365684,1512.91,"HP 80 - 350 ml - cyan - original - DesignJet - bläckpatron - för DesignJet 1050c, 1050c plus, 1055cm, 1055cm plus",0,0.64,,,,,

Here's the code

def find_new_and_diff_dataframes(local_filename, supplier_filename):
    df_local = pd.read_csv(local_filename)
    df_supplier = pd.read_csv(supplier_filename)

    df_new_output = df_supplier[~df_supplier.sku.isin(df_local['sku'].to_list())]

    df_diff_output = pd.DataFrame()
    supplier_copy = df_supplier.to_dict("records")

    for supplier_row in supplier_copy:
        local_index = df_local[df_local['sku'] == supplier_row['sku']].index.values
        if len(local_index) == 1:
            local_row = df_local.iloc[local_index[0]]
            if supplier_row['price'] != local_row['price'] or supplier_row['quantity'] != local_row['quantity']:
                df_diff_output = df_diff_output.append(supplier_row, ignore_index=True)

    df_diff_output = df_diff_output[["sku", "price", "quantity"]]
    df_diff_output['sku'] = df_diff_output['sku'].astype(int)
    df_diff_output['quantity'] = df_diff_output['quantity'].astype(int)

    return df_new_output, df_diff_output

The problem is that it takes a lot of time to generate the result. Around 220-240 seconds.

Is there any other method which I can use to lower the compute time?

Yes, you can merge and then use vectorized math operations to get price and quantity differences. Probably will be 100x faster, if not 1000x. If you provide sample data and your expected output I'm sure someone will provide you with a detailed solution. — ALollz, Jul 26 '21 at 15:42
https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples has good tips for how to create an example. It doesn't need to be all of your data, but just a self-contained small subset and then your expected output from that sample — ALollz, Jul 26 '21 at 15:47
By "product number" do you mean the sku? There is no sku in common between your samples... — Chris Wesseling, Jul 26 '21 at 19:44
@ChrisWesseling yes, product number is sku. I have edited the question accordingly, — Noob, Jul 27 '21 at 06:43

score 1 · Accepted Answer · answered Jul 26 '21 at 20:51

Your example data wasn't a complete minimal example. So I changed it a bit. But this is the core of what your function should do:

sup_df = pd.read_csv(StringIO("""sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
3131,120.26,"HP - 3 ml - svart - original - bläckpatron - för QuietJet Plus, ThinkJet",382,0.03,,HP,51604A,,
61525,295.11,APC ProtectNet - Överspänningsskydd - beige,1,0.2,,APC,PTEL2,,
67698,318.76,"Targus Notepac Clamshell - Notebook-väska - 15.6"" - svart",0,1.23,,TARGUS HARDWARE,CN01,,2021-02-19
81160,258.96,"Epson - Svart - tygfärgband - för DFX 5000, 8000, 8500",14,0.42,,Epson,C13S015055,,
81492,89.64,"Epson - Svart - tygfärgband - för LQ 1060, 2500, 2500+, 2550, 670, 680, 680Pro",15,0.08,,Epson,C13S015262,,
"""))
local_df = pd.read_csv(StringIO("""sku,price,name,quantity,weight,retail_price,vendor_name,vendor_number,vendor_part_number,available_eta
61525,295.11,APC ProtectNet - Överspänningsskydd - beige,1,0.2,,APC,PTEL2,,
67698,320.76,"Targus Notepac Clamshell - Notebook-väska - 15.6"" - svart",0,1.23,,TARGUS HARDWARE,CN01,,2021-02-19
81160,258.96,"Epson - Svart - tygfärgband - för DFX 5000, 8000, 8500",13,0.42,,Epson,C13S015055,,
259372,445.78,"HP 45 Large - 42 ml - Stor - svart - original - bläckpatron - för Deskjet 12XX, Officejet g55, g85, k60, R40, R60, R80, T45, T65, Officejet Pro 11XX",0,0.15,,,,,
1696925,1137.85,Toner/yellow 2500sheet for Phaser 6128,0,0.14,,,,,"""))

# index on the common column
sup_df.set_index('sku', inplace=True)
local_df.set_index('sku', drop=False, inplace=True)

# take natural join
diff_df = sup_df[['price', 'quantity']].join(
    local_df[['price', 'quantity']],
    rsuffix='_local',
    how='inner',
)
# select where price or quantity differ
diff_df = diff_df[(
    (diff_df['price'] != diff_df['price_local'])
    | (diff_df['quantity'] != diff_df['quantity_local'])
)]
# project just the left prices and quantities
diff_df = diff_df[['price', 'quantity']]


new_df = sup_df[(
    # a Series of bool of the left join where right values are missing
    sup_df.join(local_df['sku'])[['sku']].isna()['sku']
)]

diff_df, new_df

This should be faster than your code, as this doesn't loop over all rows, but uses operations over complete vectors.

Skuller · Answer 2 · 2021-07-26T16:33:25.880

0

first get both dataframes to a single dataframe df with headers "supplier_row_price","local_row_price","supplier_row_quantity","local_row_quantity"

def compare_df(df):
    p=[]
    q=[]
    for i in range(len(df)):
        p_diff=abs(df['supplier_row_price'][i]-df['local_row_price'][i])
        q_diff=abs(df['supplier_row_quantity'][i]-df['local_row_quantity'][i])
        p.append(p_diff)
        q.append(q_diff)
    df["price_diff"]=p
    df["quantity_diff"]=q
    return df
df=compare_df(df)

edited Jul 26 '21 at 16:33

answered Jul 26 '21 at 16:09

Skuller

23
4

I have edited the question. can you check? – Noob Jul 26 '21 at 16:12

score 0 · Answer 3 · answered Jul 26 '21 at 17:08

import pandas as pd
sp=pd.read_csv("supplier.csv")
loc=pd.read_csv("local.csv")

#supplier minus local
def compare_df(sp,loc):
    p=[]
    q=[]
    for i in range(len(sp)):
        price=sp["price"][i]-loc["price"][i]
        quantity=sp["quantity"][i]-loc["quantity"][i]
        p.append(price)
        q.append(quantity)
    p_diff=p
    q_diff=q
    data={"price_diff_supply-local":p_diff,"quantity_diff_supply-local":q_diff}
    df=pd.DataFrame(data)
    return df
ff=compare_df(sp,loc)
ff

This works fine. But I'm facing problem in the for loop. You can check the question. I need a substitute for that which will make it faster. — Noob, Jul 26 '21 at 17:18

Speedup finding differing and new rows in 2 dataframes

3 Answers3