I have a df where one column contains continuous data and a second column contains categorical data. For example:
df = {"data": [0, 1, 3, 2, 6, 0, 9, 5, 1, 3],
"category": ["A","A","A","B","A","A","A","A","B","A"]}
df = pd.DataFrame(df)
In []: df
Out[]: data category
0 0 A
1 1 A
2 3 A
3 2 B
4 6 A
5 0 A
6 9 A
7 5 A
8 1 B
9 3 A
I want to create a new column ['new_col'] where the value of each row is the sum of 'data' from that same row all the way down to the row above where the 'category' changes. Important to note is that the number of rows before the category changes is not consistent throughout the df. For example, once the computations were performed, the above df would result in:
In []: df
Out[]: data category new_col
0 0 A 4
1 1 A 4
2 3 A 3
3 2 B 22
4 6 A 20
5 0 A 14
6 9 A 14
7 5 A 5
8 1 B 4
9 3 A 3
I have found a lot of answers about summing all values of rows in a column. But, I can't quite figure out how to iterate over all of the rows and perform the type of calculation described above. I feel like it should be an easy answer, but I haven't had any luck so far.
I started by even trying to figure out how to add one row and the row immediately below it only if the category is the same. But, I'm having no luck. And, I'm relatively new to all this so I'm sure I'm not even close to being on the right track. E.g.:
for index, row in df.iterrows():
empty_list = []
if row['category'] == A:
if row.category.shift(-1) == A:
num_add = row.data + row.data.shift(-1)
empty_list.append(num_add)
data_set = pd.concat([data_set, empty_list], axis=1, join='inner')