substitute for nested for loops in pandas dataframes for big data handling

Question

There are 2 .csv files - file1(with heads Order_ID, Amount, Commission, GST, etc.) and file2(with heads hbx_id, amount, status, etc.).

I have to tell that the “amount” in File2 matches with (“Amount” – “Commission” – “GST”) in File1. This has to be performed basis “Order_ID” in File1 and “hbx_id” in File2. I am required to put the status as “matched” against items that are matching else put “not matched”. For items in File2 which are not present in File1, I am supposed to enter the status “not present” in the status column in File2.

I solved the problem with the use of nested for loops and my code looked like this -

for i in file2.index:
    flag = 0
    for j in file1.index:
        if (file1['Order_ID'][j]).count(file2['hbx_id'][i])>0:
            flag = 1
            if file1['Amount'][j]-file1['Commission'][j]-file1['GST'][j]==file2['amount'][i]:
                file2.at[i,'Status'] = 'matched'
            else:
                file2.at[i,'Status'] = 'not matched'
    if flag == 0:
        file2.at[i,'Status'] = 'not present'

This worked just fine, but my question is that this would not be a feasible option if the datasets are big(coz of the for loop iterations). Is there any substitute for this method, a function from pandas, or any other library or something?

EDIT:- The expected output looks like this -

       hbx_id      tran_date    amount  Status
0   AMBLEY127928'   01-09-2019  36.00   matched
1   SIPTBT18469'    01-09-2019  65.50   not matched
2   'KITJUJ272150'  01-09-2019  37.28   matched
3   'LEGNOE40184'   01-09-2019  20.00   matched
4   hb-pay_9492291-1'   01-09-2019  313.77  not matched
5   'SF3C145362'    01-09-2019  25.00   matched
6   hb-AMBLEY127931-1'  01-09-2019  75.50   not matched
7   'HEANEM26099'   01-09-2019  19.62   matched
8   hb-COOINT242360-1'  01-09-2019  44.15   not present
9   'MUNDCT3880'    01-09-2019  142.00  matched
10  'KDA178133' 01-09-2019  24.53   matched
11  'KSCA1492699'   01-09-2019  9.81    matched
12  'FOROCH372249'  01-09-2019  14.72   matched
13  'CHAMXN202625'  01-09-2019  29.43   matched
14  'DOSFSX48229'   01-09-2019  49.06   matched
15  'ARAEZG242324'  01-09-2019  25.00   matched
16  hb-SHRWOG391499-1'  01-09-2019  19.56   matched
17  'MALSJU13377'   01-09-2019  16.50   matched
18  hb-OMKXDT420930-1'  01-09-2019  10.00   matched
19  'SCCA528997'    01-09-2019  20.00   not matched

file1.head()

       Transaction_ID                          Order_ID Transaction_Date    Updated_Date    Transaction_Type    Status  MID Customer_ID Customer_Nickname   Customer_Phone_No.  ... Commission_Rate Product_Code    GMV_Tier    Transaction_Slab    Request_Type    Refund_Type Refund_Actor    Split_Flag  Split_MID   Split_Id
0   '20190901111212800100168044885072100'   hb-SIPJZO32656-1'   '2019-09-01 00:54:14'   '2019-09-01 00:54:33'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN 9804817853' ... NaN '51051000100000000010'  NaN NaN 'SEAMLESS'  NaN NaN NaN NaN NaN
1   '20190901111212800110168956985465456'   'SHRWOG391493'  '2019-09-01 02:14:43'   '2019-09-01 02:14:46'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
2   '20190901111212800110168835885639302'   'MIDYEY245461'  '2019-09-01 00:23:36'   '2019-09-01 00:23:39'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
3   '20190901111212800110168061185352899'   'KITJUJ272194'  '2019-09-01 03:40:00'   '2019-09-01 03:40:03'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  NaN NaN NaN ... NaN '51051000100000000001'  NaN NaN 'SEAMLESS_3D_FORM'  NaN NaN NaN NaN NaN
4   '20190901111212800110168140285515665'   'EXLRYB3995'    '2019-09-01 00:08:21'   '2019-09-01 00:08:21'   'ACQUIRING' 'SUCCESS'   'EatGoo08603428284199'  '216810000026794940140' NaN '9515053146'    ... NaN '51051000100000000001'  NaN NaN 'AUTO_DEBIT'    NaN NaN NaN NaN NaN

file2.head()

        hbx_id      tran_date   amount
0   AMBLEY127928'   01-09-2019  36.00
1   SIPTBT18469'    01-09-2019  65.50
2   'KITJUJ272150'  01-09-2019  37.28
3   'LEGNOE40184'   01-09-2019  20.00
4   hb-pay_9492291-1'   01-09-2019  313.77

if you convert everything to numpy data types, you could use numba to speed things up. but looks like you can probably vectorise the loop. — matman9, May 14 '20 at 08:28
kindly share some data, with ur expected output. that way u get a targeted answer, or a redirect if sth similar has been answered before. [guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — sammywemmy, May 14 '20 at 08:28
@matman9 can you please guide me through the vectorized approach? I'm facing trouble implementing it... — Abhyudaya Nilosey, May 14 '20 at 12:08
could post the columns that you are using for file1, file1[['Order_ID', 'Amount', 'Commission', 'GST']], is difficult to see the data, thanks — matman9, May 15 '20 at 09:32

score 0 · Answer 1 · answered May 15 '20 at 11:08

For a start I would simplify your data so make a new dataframe for file1 which is just the columns that you are using;

df1 = file1[['Order_ID', 'Amount', 'Commission', 'GST']]

with this you can then do you calculation in your first if statement and make a new column with these values;

df1['amount_check'] = df1['Amount'] - df1['Commission'] - df1['GST']

Assuming that the df1['Order_ID'] are unique values and some of which are in file2['hbx_id'] you could merge the two dataframes on these two columns keeping everything so you would have one dataframe with all your data aligned.

df1 = df1.merge(file1, left_on='Order_ID', right_on='hbx_id') # might need to add more parameters...

You could simply then make a new column df1['Status'] where if the df1['amount_check'] == df1['amount'] then you give the value 'matched' etc

substitute for nested for loops in pandas dataframes for big data handling

1 Answers1