I'm trying to insert a blank line into a dataframe whenever a value changes. However my loop accumulate an extra step each time it runs and I can't figure out why.
import pandas as pd
# The blank row to insert
blank_row = [None, None]
#The first order number
first_order_number = df.iloc[0, 0]
# Loops over the df and if a new order number, insert the row.
for index, row in df.iterrows():
if row['Purchase order'] != first_order_number:
last_df = pd.DataFrame(np.insert(last_df.values, index, blank_row,
axis=0), columns=last_df.columns)
# Set order variable to the new
first_order_number = row['Purchase order']
Input data:
Purchase order Store
795571 4
795571 4
795562 5
795562 5
795562 5
795586 9
795586 9
795586 9
795588 10
795588 10
795588 10
795588 10
Expected output:
Purchase order Store
795571 4
795571 4
795562 5
795562 5
795562 5
795586 9
795586 9
795586 9
795588 10
795588 10
795588 10
795588 10
Output:
Purchase order Store
795571 4
795571 4
795562 5
795562 5
795562 5
795586 9
795586 9
795586 9
795588 10
795588 10
795588 10
795588 10
My best guess is that when the variable first_order_number gets updated it looks at the current index and inserts the value at ['Purchase order'] this step then turns to two next time around. To fix this I changed it to:
first_order_number = df.iloc[index + 1, 0]
This just put the counter out of sync and filled up with None-rows after store 5. How can I fix this?
Maybe there is also a better way to achieve this since I know loops over DFs are slow.
Many thanks for all input