Speeding up pandas code by replacing iterrows

Question

I have a Dataframe like below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1500 |   1 |        0 |      1500 |
| Inv_001   |     1720 |   777 |   100 |   1 |      260 |       100 |
| Inv_001   |     1720 |   888 |   120 |   1 |      260 |       120 |
| Inv_002   |     1160 |   360 |   700 |   1 |        0 |       700 |
| Inv_002   |     1160 |   777 |   100 |   1 |      360 |       100 |
| Inv_002   |     1160 |   888 |   120 |   1 |      360 |       120 |
| Inv_002   |     1160 |   999 |   140 |   1 |      360 |       140 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

I want to add the ProdTotal value, whose MainCode is equal to the Item#. Inspired from the answers I got for my question, I managed to produce the desired output mentioned below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1720 |   1 |        0 |      1720 |
| Inv_002   |     1160 |   360 |  1060 |   1 |        0 |      1060 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

using the Code below

df = pd.read_csv('data.csv')
df_grouped = dict(tuple(df.groupby(['InvoiceNo'])))

remove_index= []
ids = 0

for x in df_grouped:
    for index, row in df_grouped[x].iterrows():
        ids += 1
        try:
            main_code_data = df_grouped[x].loc[df_grouped[x]['MainCode'] == row['Item#']]
            length = len(main_code_data['Item#'])
            iterator = 0
            index_value = 0    
            for i in range(len(df_grouped[x].index)):
                index_value += df_grouped[x].at[index + iterator, 'ProdTotal']
                df.at[index, 'ProdTotal'] = index_value

                iterator += 1

            for item in main_code_data.index:
                remove_index.append(item)

        except:
            pass

df = df.drop(remove_index)

But the data consists of millions of rows and this code runs very slowly. A brief google search & comments from other members, I got to know that iterrows() is making the code run slow. How can I replace iterrows() to make my code more efficient and more pythonic?

Why is it not zero, none of the rows have main codes that are equal to the item numbers. Do you mean just adding them up in groups? — ifly6, Aug 23 '19 at 15:54
`MainCode` equal to the `Item#` in the each `InvoiceNo`. eg: in `Inv_001`, `Item# 777 888` has `MainCode 260`. so these items are part of `260` — Sunni, Aug 23 '19 at 15:58
@ifly6 posted a revised example. this is not a simple groupby. itms are added in `ProdTotal` only when they have an association to another code. in here it is `260` — Sunni, Aug 23 '19 at 16:06
But in your example dataframe, none of the `MainCode` values are equal to `Item#`. — m13op22, Aug 23 '19 at 16:16
@HS-nebula Please refer this [question](https://stackoverflow.com/questions/56317856/allocating-the-sub-items-to-the-main-items-in-pandas) of mine. it clarifies more. — Sunni, Aug 23 '19 at 16:18
Which column has the the smallest number of unique values, `Item#` or `MainCode`? — m13op22, Aug 23 '19 at 16:20
@HS-nebula column `Item#`. Think of this as the main Product & other items (Eg in `Inv_001` `777, 888`And in `Inv_002` `777 888 999`) are the sub products associated to this. — Sunni, Aug 23 '19 at 16:24
@HS-nebula is it clear? if not. please feel free to ask for more clarifications please. — Sunni, Aug 23 '19 at 16:33

Alexander · Accepted Answer · 2019-08-23T17:41:34.590

This works on the sample data. Does it work on your actual data?

# Sample data.
df = pd.DataFrame({
    'InvoiceNo': ['Inv_001'] * 3 + ['Inv_002'] * 5,
    'totalamt': [1720] * 3 + [1160] * 5,
    'Item#': [260, 777, 888, 260, 777, 888, 999, 111],
    'price': [1500, 100, 120, 700, 100, 120, 140, 100],
    'qty': [1] * 8,
    'MainCode': [0, 260, 260, 0, 260, 260, 260, 0],
    'ProdTotal': [1500, 100, 120, 700 ,100 ,120, 140, 100]
})

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})

result = df[df['MainCode'].eq(0)]
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result['price'] = result.eval('ProdTotal / qty')
result = result.drop(columns=['ProdSubTotal'])

>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

We first want to get the aggregate ProdTotal per InvoiceNo and MainCode (but only in the case where the MainCode is not equal to zero, .ne(0)):

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
>>> subtotals
  InvoiceNo  MainCode  ProdTotal
0   Inv_001       260        220
1   Inv_002       260        360

We then need to filter this data from the main dataframe, so we just filter where the MainCode equals zero, .eq(0).

result = df[df['MainCode'].eq(0)]
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0       1500
3   Inv_002      1160    260    700    1         0        700
7   Inv_002      1160    111    100    1         0        100

We want to join the subtotals to this result where the InvoiceNo matches and the Item# in result matches the MainCode in subtotal. One way to do this is change the column names in subtotal and then perform a left merge:

subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal  ProdSubTotal
0   Inv_001      1720    260   1500    1         0       1500         220.0
1   Inv_002      1160    260    700    1         0        700         360.0
2   Inv_002      1160    111    100    1         0        100           NaN

Now we add the ProdSubTotal to the ProdTotal and drop the column.

result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result = result.drop(columns=['ProdSubTotal'])
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0     1720.0
1   Inv_002      1160    260    700    1         0     1060.0
2   Inv_002      1160    111    100    1         0      100.0

Finally, we recalculate the price given the qty and new ProdTotal.

result['price'] = result.eval('ProdTotal / qty')
>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

in sample data it works well. But, in the Original data it is not. let me check my code further & will get back to you. Thank you.. — Sunni, Aug 23 '19 at 17:40
Perfect. it's 3x faster now. Thanks a lot for suggesting to avoid `iterrows()` & providing this solution. — Sunni, Aug 23 '19 at 19:59

score -1 · Answer 2 · answered Aug 23 '19 at 18:10

-1

Do pandas merge. Split the data into two dataframes, one with invoice, total_amt,item# price,qty and another with invoice, maincode. The do a inner join using merge operation after which you can sum the values of columns row-wise and drop those columns that are not required.

answered Aug 23 '19 at 18:10

Parijat Bhatt

664
4
6

I don't understand why there is a downvote on my answer. – Parijat Bhatt Aug 23 '19 at 20:12

Speeding up pandas code by replacing iterrows

2 Answers2

Linked