0

I have a dataframe with time-series data as follows:

      Date      Value
0  2021-12-01     A
1  2021-12-02     A
2  2021-12-03     A
3  2021-12-04     B
4  2021-12-05     B
5  2021-12-06     A
6  2021-12-07     A
7  2021-12-08     C

I'm trying to reduce this to only have the first date of each continuous block for the Value column. So the result would look like:

      Date      Value
0  2021-12-01     A
1  2021-12-04     B
2  2021-12-06     A
3  2021-12-08     C

I've tried a bunch of different ways of masking, dropping duplicates based on the mask, etc. but cannot do it. Any help is appreciated!

siki
  • 9,077
  • 3
  • 27
  • 36

1 Answers1

0

You can use ne (not equals) + shift​ to create a mask where the first value of each consecutive group is True, and then cumsum to create a unique for each group that's shared by all its items.

Then, drop_duplicates based on that, and use index of the returned rows to index the dataframe:

subset = df.loc[df['Value'].ne(df['Value'].shift(1)).cumsum().drop_duplicates().index]

Output:

>>> subset
         Date Value
0  2021-12-01     A
3  2021-12-04     B
5  2021-12-06     A
7  2021-12-08     C
  • Awesome, thank you! I had the first part (`ne + shift`) and the last part (`drop_duplicates`) but was missing the `cumsum`. – siki Dec 22 '21 at 23:35
  • Yeah, cumsum is useful for WAY more than just getting the cumsum of a numeric column and putting it somewhere... ;) –  Dec 22 '21 at 23:36