0

I need help making a cumulative sum that resets when a condition changes. I’ve used a loop but the dataset is too heavy. Also tried using both these links but I can’t get my desired output. I can't remove duplicates from the dataframe because I need information that's not shown in the example table.

Cumulative sum that resets based on the year

Pandas cumulative count

I have the columns License Plate and Transp Doc and need a new one called Deliveries, the condition would be: If it is the same License Plate AND the same Transp Doc it is considered 1 delivery, however if it's not the same Transp Doc add 1, and if it is not the same License Plate reset the cumulative sum. The excel formula (set on cell C3) on which I’m basing this approach would be =IF(A2=A3;IF(B2=B3;C2+0;C2+1);1)

A B C
1 License Plate Transp Doc Deliveries
2 AAA1111 65184 1
3 AAA1111 65186 2
4 AAA1111 65188 3
5 BBB2222 65195 1
6 BBB2222 65195 1
7 BBB2222 65201 2
8 CCC3333 65207 1
9 CCC3333 65207 1
10 DDD4444 65212 1
import pandas as pd


DF = pd.DataFrame({'License Plate': ["AAA1111","AAA1111","AAA1111","BBB2222","BBB2222","BBB2222","CCC3333","CCC3333","DDD4444"],
          'Transp Doc': [65184,65186,65188,65195,65195,65201,65207,65207,65212],
          'Deliveries': [1,2,3,1,1,2,1,1,1],
          })

The furthest I got was using this line:

DF['Deliveries'] = DF['License Plate'].eq(DF['License Plate'].shift()).cumsum()

Which just sums the if upper row is equal, without considering if Transp Doc is the same, and doesn’t reset when plate changes.

AcK
  • 2,063
  • 2
  • 20
  • 27
Gabriel_Koch
  • 401
  • 3
  • 14

1 Answers1

1

Find the duplicate rows (based on license plate and transp doc), that will result in a boolean series. Invert the series and you can add up the boolean values as integers 0 and 1 (but not yet).

Then split this series on changing license plate, by grouping by the license plate. Finally use the cumulative sum on those groups (adding the bools as ints), and assign the result to the new column in your dataframe.

import pandas as pd
df = pd.DataFrame({'License Plate': ["AAA1111","AAA1111","AAA1111","BBB2222","BBB2222","BBB2222","CCC3333","CCC3333","DDD4444"],
          'Transp Doc': [65184,65186,65188,65195,65195,65201,65207,65207,65212],
          'Deliveries': [1,2,3,1,1,2,1,1,1],
          })
df['Deliveries'] = (~df.duplicated(['License Plate', 'Transp Doc'])).groupby(df['License Plate']).cumsum()
print(df)
  License Plate  Transp Doc  Deliveries
0       AAA1111       65184           1
1       AAA1111       65186           2
2       AAA1111       65188           3
3       BBB2222       65195           1
4       BBB2222       65195           1
5       BBB2222       65201           2
6       CCC3333       65207           1
7       CCC3333       65207           1
8       DDD4444       65212           1
9769953
  • 10,344
  • 3
  • 26
  • 37