2

I have a df where one column contains continuous data and a second column contains categorical data. For example:

df = {"data": [0, 1, 3, 2, 6, 0, 9, 5, 1, 3], 
    "category": ["A","A","A","B","A","A","A","A","B","A"]} 

df = pd.DataFrame(df)

In []: df
Out[]: data    category    
0       0         A
1       1         A
2       3         A
3       2         B
4       6         A
5       0         A
6       9         A
7       5         A
8       1         B
9       3         A

I want to create a new column ['new_col'] where the value of each row is the sum of 'data' from that same row all the way down to the row above where the 'category' changes. Important to note is that the number of rows before the category changes is not consistent throughout the df. For example, once the computations were performed, the above df would result in:

In []: df
Out[]: data    category  new_col  
0       0         A         4
1       1         A         4
2       3         A         3
3       2         B         22
4       6         A         20
5       0         A         14
6       9         A         14
7       5         A         5
8       1         B         4
9       3         A         3

I have found a lot of answers about summing all values of rows in a column. But, I can't quite figure out how to iterate over all of the rows and perform the type of calculation described above. I feel like it should be an easy answer, but I haven't had any luck so far.

I started by even trying to figure out how to add one row and the row immediately below it only if the category is the same. But, I'm having no luck. And, I'm relatively new to all this so I'm sure I'm not even close to being on the right track. E.g.:

for index, row in df.iterrows():
    empty_list = []
    if row['category'] == A:
        if row.category.shift(-1) == A:
            num_add = row.data + row.data.shift(-1)
            empty_list.append(num_add)

data_set = pd.concat([data_set, empty_list], axis=1, join='inner')
davidcox
  • 23
  • 5
  • Why does the sum not change on row 8 ? – Erfan Jul 04 '19 at 19:59
  • Good question! Because the category changed, it's now summing only rows 8 and 9. I see how that makes the example confusing though. I just edited the example to remove that confusion. – davidcox Jul 04 '19 at 20:02
  • Seems like you are summing from the bottom up. Row 9 + Row 8 = 4, which is the sum in row 8. But whydoes is not reset in row 8 since the category changed from A to B. But it does reset in row 2. – Erfan Jul 04 '19 at 20:13
  • I see what you’re saying. I wasn’t clear with how the values are being summed. In the data set this represents, a category label of ‘B’ technically means the first row of a new condition. So all rows between one ‘B’ & the row above the next ‘B’ are from the same condition. And, I’m trying to compute the described operation for each condition. So, rows 0-2 are one condition, rows 3-7 are one condition, rows 8-9 are a third condition Some of the conditions repeat at later points in time in the df. But, I don’t want to sum across all conditions of the same type. Only the ones that are contiguous. – davidcox Jul 04 '19 at 20:20

1 Answers1

1

You can try this:

df['new_col'] = (df.groupby(df['category'].eq('B').cumsum())
                   ['data'].apply(lambda x: (x[::-1].cumsum())[::-1] )
                )

which gives:

   data category  new_col
0     0        A        4
1     1        A        4
2     3        A        3
3     2        B       22
4     6        A       20
5     0        A       14
6     9        A       14
7     5        A        5
8     1        B        4
9     3        A        3
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks Quang Hoang!! That did it! – davidcox Jul 04 '19 at 20:39
  • Hi @quang-hoang. This is a nice solution. I have a problem close to it but little different, would you please take a look: https://stackoverflow.com/questions/67453646/create-new-dataframe-using-multiple-conditions-across-different-timeline-and-loc – Roy May 09 '21 at 02:51