I have a Dataframe like below
+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001 | 1720 | 260 | 1500 | 1 | 0 | 1500 |
| Inv_001 | 1720 | 777 | 100 | 1 | 260 | 100 |
| Inv_001 | 1720 | 888 | 120 | 1 | 260 | 120 |
| Inv_002 | 1160 | 360 | 700 | 1 | 0 | 700 |
| Inv_002 | 1160 | 777 | 100 | 1 | 360 | 100 |
| Inv_002 | 1160 | 888 | 120 | 1 | 360 | 120 |
| Inv_002 | 1160 | 999 | 140 | 1 | 360 | 140 |
| Inv_002 | 1160 | 111 | 100 | 1 | 0 | 100 |
+-----------+----------+-------+-------+-----+----------+-----------+
I want to add the ProdTotal
value, whose MainCode
is equal to the Item#
.
Inspired from the answers I got for my question, I managed to produce the desired output mentioned below
+-----------+----------+-------+-------+-----+----------+-----------+
| InvoiceNo | totalamt | Item# | price | qty | MainCode | ProdTotal |
+-----------+----------+-------+-------+-----+----------+-----------+
| Inv_001 | 1720 | 260 | 1720 | 1 | 0 | 1720 |
| Inv_002 | 1160 | 360 | 1060 | 1 | 0 | 1060 |
| Inv_002 | 1160 | 111 | 100 | 1 | 0 | 100 |
+-----------+----------+-------+-------+-----+----------+-----------+
using the Code below
df = pd.read_csv('data.csv')
df_grouped = dict(tuple(df.groupby(['InvoiceNo'])))
remove_index= []
ids = 0
for x in df_grouped:
for index, row in df_grouped[x].iterrows():
ids += 1
try:
main_code_data = df_grouped[x].loc[df_grouped[x]['MainCode'] == row['Item#']]
length = len(main_code_data['Item#'])
iterator = 0
index_value = 0
for i in range(len(df_grouped[x].index)):
index_value += df_grouped[x].at[index + iterator, 'ProdTotal']
df.at[index, 'ProdTotal'] = index_value
iterator += 1
for item in main_code_data.index:
remove_index.append(item)
except:
pass
df = df.drop(remove_index)
But the data consists of millions of rows and this code runs very slowly. A brief google search & comments from other members, I got to know that iterrows()
is making the code run slow. How can I replace iterrows()
to make my code more efficient and more pythonic?