1

I have a Dataframe like below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1500 |   1 |        0 |      1500 |
| Inv_001   |     1720 |   777 |   100 |   1 |      260 |       100 |
| Inv_001   |     1720 |   888 |   120 |   1 |      260 |       120 |
| Inv_002   |     1160 |   360 |   700 |   1 |        0 |       700 |
| Inv_002   |     1160 |   777 |   100 |   1 |      360 |       100 |
| Inv_002   |     1160 |   888 |   120 |   1 |      360 |       120 |
| Inv_002   |     1160 |   999 |   140 |   1 |      360 |       140 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

I want to add the ProdTotal value, whose MainCode is equal to the Item#. Inspired from the answers I got for my question, I managed to produce the desired output mentioned below

+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001   |     1720 |   260 |  1720 |   1 |        0 |      1720 |
| Inv_002   |     1160 |   360 |  1060 |   1 |        0 |      1060 |
| Inv_002   |     1160 |   111 |   100 |   1 |        0 |       100 |
+-----------+----------+-------+-------+-----+----------+-----------+

using the Code below

df = pd.read_csv('data.csv')
df_grouped = dict(tuple(df.groupby(['InvoiceNo'])))

remove_index= []
ids = 0

for x in df_grouped:
    for index, row in df_grouped[x].iterrows():
        ids += 1
        try:
            main_code_data = df_grouped[x].loc[df_grouped[x]['MainCode'] == row['Item#']]
            length = len(main_code_data['Item#'])
            iterator = 0
            index_value = 0    
            for i in range(len(df_grouped[x].index)):
                index_value += df_grouped[x].at[index + iterator, 'ProdTotal']
                df.at[index, 'ProdTotal'] = index_value

                iterator += 1

            for item in main_code_data.index:
                remove_index.append(item)

        except:
            pass

df = df.drop(remove_index)

But the data consists of millions of rows and this code runs very slowly. A brief google search & comments from other members, I got to know that iterrows() is making the code run slow. How can I replace iterrows() to make my code more efficient and more pythonic?

m13op22
  • 2,168
  • 2
  • 16
  • 35
Sunni
  • 365
  • 1
  • 11
  • Why is it not zero, none of the rows have main codes that are equal to the item numbers. Do you mean just adding them up in groups? – ifly6 Aug 23 '19 at 15:54
  • What should be zero? @ifly6 – new Q Open Wid Aug 23 '19 at 15:56
  • `MainCode` equal to the `Item#` in the each `InvoiceNo`. eg: in `Inv_001`, `Item# 777 888` has `MainCode 260`. so these items are part of `260` – Sunni Aug 23 '19 at 15:58
  • @ifly6 posted a revised example. this is not a simple groupby. itms are added in `ProdTotal` only when they have an association to another code. in here it is `260` – Sunni Aug 23 '19 at 16:06
  • 1
    But in your example dataframe, none of the `MainCode` values are equal to `Item#`. – m13op22 Aug 23 '19 at 16:16
  • @HS-nebula Please refer this [question](https://stackoverflow.com/questions/56317856/allocating-the-sub-items-to-the-main-items-in-pandas) of mine. it clarifies more. – Sunni Aug 23 '19 at 16:18
  • Which column has the the smallest number of unique values, `Item#` or `MainCode`? – m13op22 Aug 23 '19 at 16:20
  • 1
    @HS-nebula column `Item#`. Think of this as the main Product & other items (Eg in `Inv_001` `777, 888`And in `Inv_002` `777 888 999`) are the sub products associated to this. – Sunni Aug 23 '19 at 16:24
  • @HS-nebula is it clear? if not. please feel free to ask for more clarifications please. – Sunni Aug 23 '19 at 16:33

2 Answers2

1

This works on the sample data. Does it work on your actual data?

# Sample data.
df = pd.DataFrame({
    'InvoiceNo': ['Inv_001'] * 3 + ['Inv_002'] * 5,
    'totalamt': [1720] * 3 + [1160] * 5,
    'Item#': [260, 777, 888, 260, 777, 888, 999, 111],
    'price': [1500, 100, 120, 700, 100, 120, 140, 100],
    'qty': [1] * 8,
    'MainCode': [0, 260, 260, 0, 260, 260, 260, 0],
    'ProdTotal': [1500, 100, 120, 700 ,100 ,120, 140, 100]
})

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})

result = df[df['MainCode'].eq(0)]
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result['price'] = result.eval('ProdTotal / qty')
result = result.drop(columns=['ProdSubTotal'])

>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0

We first want to get the aggregate ProdTotal per InvoiceNo and MainCode (but only in the case where the MainCode is not equal to zero, .ne(0)):

subtotals = df[df['MainCode'].ne(0)].groupby(
    ['InvoiceNo', 'MainCode'], as_index=False)['ProdTotal'].sum()
>>> subtotals
  InvoiceNo  MainCode  ProdTotal
0   Inv_001       260        220
1   Inv_002       260        360

We then need to filter this data from the main dataframe, so we just filter where the MainCode equals zero, .eq(0).

result = df[df['MainCode'].eq(0)]
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0       1500
3   Inv_002      1160    260    700    1         0        700
7   Inv_002      1160    111    100    1         0        100

We want to join the subtotals to this result where the InvoiceNo matches and the Item# in result matches the MainCode in subtotal. One way to do this is change the column names in subtotal and then perform a left merge:

subtotals = subtotals.rename(columns={'MainCode': 'Item#', 'ProdTotal': 'ProdSubTotal'})
result = result.merge(subtotals, on=['InvoiceNo', 'Item#'], how='left')
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal  ProdSubTotal
0   Inv_001      1720    260   1500    1         0       1500         220.0
1   Inv_002      1160    260    700    1         0        700         360.0
2   Inv_002      1160    111    100    1         0        100           NaN

Now we add the ProdSubTotal to the ProdTotal and drop the column.

result['ProdTotal'] += result['ProdSubTotal'].fillna(0)
result = result.drop(columns=['ProdSubTotal'])
>>> result
  InvoiceNo  totalamt  Item#  price  qty  MainCode  ProdTotal
0   Inv_001      1720    260   1500    1         0     1720.0
1   Inv_002      1160    260    700    1         0     1060.0
2   Inv_002      1160    111    100    1         0      100.0

Finally, we recalculate the price given the qty and new ProdTotal.

result['price'] = result.eval('ProdTotal / qty')
>>> result
  InvoiceNo  totalamt  Item#   price  qty  MainCode  ProdTotal
0   Inv_001      1720    260  1720.0    1         0     1720.0
1   Inv_002      1160    260  1060.0    1         0     1060.0
2   Inv_002      1160    111   100.0    1         0      100.0
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • in sample data it works well. But, in the Original data it is not. let me check my code further & will get back to you. Thank you.. – Sunni Aug 23 '19 at 17:40
  • Perfect. it's 3x faster now. Thanks a lot for suggesting to avoid `iterrows()` & providing this solution. – Sunni Aug 23 '19 at 19:59
-1

Do pandas merge. Split the data into two dataframes, one with invoice, total_amt,item# price,qty and another with invoice, maincode. The do a inner join using merge operation after which you can sum the values of columns row-wise and drop those columns that are not required.

Parijat Bhatt
  • 664
  • 4
  • 6