2

I have a pandas dataframe like the following with columns "type" and "val0".

type    val0
   0       0
   0       0
   0       0
   1       1
   1       3
   0       0
   0       0
   0       0
   0       0
   1       3
   0       0

The column "type" can be 0 or 1, and val0 has interesting values only for type == 1.

I am trying to create two columns val1 and val2. They are defined in a similar way:

  • For val1, I want to find consecutive occurrences of at least one row of type 1, take the sum of val0 for the occurrences, and use the sum to fill in the column between the previous occurrences of type 1 up to the last row in the consecutive occurrences.
  • For val2, I want to find consecutive occurrences of at least one row of type 1, take the minimum value in val0 for the occurences, and use the min to fill in the column from the first row in the consecutive occurrences to the next occurrences of type 1.

Using 0 as the default value for val1 and val2, I would get this output for the above dataframe:

type    val0    val1    val2
   0       0       4       0
   0       0       4       0
   0       0       4       0
   1       1       4       1
   1       3       4       1
   0       0       3       1
   0       0       3       1
   0       0       3       1
   0       0       3       1
   1       3       3       3
   0       0       0       3

Since type == 1 for rows 4, 5, and 10,

  • val1 is 1+3 = 4 for rows 1-5 and 3 for rows 6-10
  • val2 is min(1,3) = 1 for rows 4-9 and 3 for rows 10-11

What is an efficient way to do this? My actual dataframe has a few million rows and I would like to create many columns like val1 and val2 that use different aggregates of type 1 rows.

user79913
  • 47
  • 4

1 Answers1

2

Using the shift-cumsum pattern (python pandas - creating a column which keeps a running count of consecutive values)

grouper = df.groupby(df['type'].ne(df['type'].shift()).cumsum())
df['val1'] = grouper['val0'].transform(np.sum)
df['val2'] = grouper['val0'].transform(np.min)
df.loc[df['type'].eq(0), ['val1', 'val2']] = np.nan
df['val1'] = df['val1'].bfill().fillna(0)
df['val2'] = df['val2'].ffill().fillna(0)
>>> df
    type  val0  val1  val2
0      0     0   4.0   0.0
1      0     0   4.0   0.0
2      0     0   4.0   0.0
3      1     1   4.0   1.0
4      1     3   4.0   1.0
5      0     0   3.0   1.0
6      0     0   3.0   1.0
7      0     0   3.0   1.0
8      0     0   3.0   1.0
9      1     3   3.0   3.0
10     0     0   0.0   3.0
Alexander
  • 105,104
  • 32
  • 201
  • 196