4

I am working now in getting a cumulative sum column using pandas. However, this column most include cumulative sum only if other column value is greater than other column value. Here's an example of my current data:

Index     A       B       C
0         1       20      3
1         10      15      11
2         20      12      25
3         30      18      32
4         40      32      17
5         50      12      4

Then I want to cumsum() column A if column B is greater than C, if not value is zero. Result column D in original df should look like:

Index     A       B       C      D
0         1       20      3      1
1         10      15      11     11
2         20      12      25     0
3         30      18      32     0
4         40      32      17     40
5         50      12      4      90

I appreciate any support in advance.

joseph
  • 67
  • 4
  • Which part are you having trouble with? [https://pandas.pydata.org/docs/user_guide/index.html](https://pandas.pydata.org/docs/user_guide/index.html) – wwii Sep 29 '20 at 21:15
  • I tried using `iterrows()` with no desired results – joseph Sep 29 '20 at 21:25

3 Answers3

3

There probably is more legant solution, but this also works.

We first create two dummy columns - x and x_shift.

df.x is conditional where we retain values of df.A where df.B > df.C.

df.x_shift is where we shift values one row below and fill na with 0.

In last step we conditionally add df.A and df.x_shift and then drop df.x and df.x_shift

df['x'] = pd.DataFrame(np.where(df.B>df.C, df.A ,0))
df['x_shift'] = df.x.shift(1).fillna(0)
df['D'] = pd.DataFrame(np.where(df.B >df.C, df.A+df.x_shift,0))
df= df.drop(['x','x_shift'], axis=1
r_hudson
  • 193
  • 8
  • 1
    `x` and `x_shift` could be separate Series then you wouldn't need to drop them at the end. – wwii Sep 29 '20 at 22:34
3
df = pd.DataFrame({'A': {0: 1, 1: 10, 2: 20, 3: 30, 4: 40, 5: 50},
                   'B': {0: 20, 1: 15, 2: 12, 3: 18, 4: 32, 5: 12},
                   'C': {0: 3, 1: 11, 2: 25, 3: 32, 4: 17, 5: 4}})

Make a boolean Series for your condition and identify consecutive groups of True or False

b_gt_c = df.B > df.C
groups = b_gt_c.ne(b_gt_c.shift()).cumsum()

In [107]: b_gt_c
Out[107]: 
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

In [108]: groups
Out[108]: 
0    1
1    1
2    2
3    2
4    3
5    3
dtype: int32

Group by those groups; multiply the cumsum of each group by the condition; assign the result to the new df column.

gb = df.groupby(groups)
for k,g in gb:
    df.loc[g.index,'D'] = g['A'].cumsum() * b_gt_c[g.index]

In [109]: df
Out[109]: 
    A   B   C     D
0   1  20   3   1.0
1  10  15  11  11.0
2  20  12  25   0.0
3  30  18  32   0.0
4  40  32  17  40.0
5  50  12   4  90.0

You could skip the for loop as well :

df['G'] = np.where(df.B.gt(df.C), df.A, np.NaN)
group = df.B.gt(df.C).ne(df.B.gt(df.C).shift()).cumsum()
df['G'] = df.groupby(group).G.cumsum().fillna(0)

Identifying consecutive occurrence of values from SO Q&A: Grouping dataframe based on consecutive occurrence of values

wwii
  • 23,232
  • 7
  • 37
  • 77
0

While it's a little barbaric you could convert to numpy arrays and then write a simple catch that goes through the 3 arrays and compares values.