2

I want to drop rows with zero in the "value" column up until the index of the first non-zero value for each group.

Input

df = pd.DataFrame({'date': ['2019-01-01', '2019-01-02', '2019-01-03','2019-01-04',
                           '2019-01-01', '2019-01-02', '2019-01-03','2019-01-04',
                            '2019-01-01', '2019-01-02', '2019-01-03','2019-01-04'],
                  'value' : [0, 50, 100, 120, 0, 10, 0, 20, 0, 0, 10, 0],
                  'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
                 })
df
    date         value     group
0   2019-01-01   0         A
1   2019-01-02   50        A
2   2019-01-03   100       A
3   2019-01-04   120       A
4   2019-01-01   0         B
5   2019-01-02   10        B
6   2019-01-03   0         B
7   2019-01-04   20        B
8   2019-01-01   0         C
9   2019-01-02   0         C
10  2019-01-03   10        C
11  2019-01-04   0         C

Output

    date         value     group
1   2019-01-02   50        A
2   2019-01-03   100       A
3   2019-01-04   120       A
5   2019-01-02   10        B
6   2019-01-03   0         B
7   2019-01-04   20        B
10  2019-01-03   10        C
11  2019-01-04   0         C

Similar to Find first non-zero value in each column of pandas DataFrame, but dropping the data up until index and grouped.

Nick D
  • 430
  • 1
  • 4
  • 11

1 Answers1

6

Use groupby and cumsum, compare the result to zero:

df[df.groupby('group')['value'].cumsum().gt(0)]

          date  value group
1   2019-01-02     50     A
2   2019-01-03    100     A
3   2019-01-04    120     A
5   2019-01-02     10     B
6   2019-01-03      0     B
7   2019-01-04     20     B
10  2019-01-03     10     C
11  2019-01-04      0     C

This is a slightly more generic version which should work for removing rows with any leading value,

thresh = 0
df[df['value'].ne(thresh).groupby(df['group']).cumsum().gt(0)]

          date  value group
1   2019-01-02     50     A
2   2019-01-03    100     A
3   2019-01-04    120     A
5   2019-01-02     10     B
6   2019-01-03      0     B
7   2019-01-04     20     B
10  2019-01-03     10     C
11  2019-01-04      0     C
cs95
  • 379,657
  • 97
  • 704
  • 746
  • 3
    good approach, might not work if series contains negative values. – Quang Hoang Jun 12 '19 at 18:04
  • @QuangHoang Right, OP would probably want to clarify the expected behaviour in that case (if they're to be dropped, then the solution is easy. If they are to be kept, then it'll be a bit harder but doable.) – cs95 Jun 12 '19 at 18:11
  • 1
    The comment was actually for your first solution. 2nd solution would work just fine. – Quang Hoang Jun 12 '19 at 18:12
  • 1
    @cs95 Thank you for your solution! Ideally I'd like to drop leading rows with an arbitrary threshold, e.g. the first value above 20, or the first value below -10 – Nick D Jun 12 '19 at 20:34
  • 1
    @NickD My second solution will work. Please take a look (I've edited for clarity). – cs95 Jun 12 '19 at 20:36
  • Your text implies you could have more then one leading zero row, it's useful to include an example of this to test it. – smci Nov 30 '21 at 15:47
  • @QuangHoang, you could just use `df[df.assign(value=df["value"].apply(lambda x: abs(x))).groupby('group')['value'].cumsum() != 0]`. The usage of absolute values is required as it could happen that negative and positive values will add up to 0 with `cumsum` – Stoney Jan 25 '22 at 10:34