3

My problems is the same as, How to group by continuous records in SQL, only I need a solution in Pandas.

Given a df like

ID  Colour
------------
 1   Red
 2   Red
 3   Red
 4   Red
 5   Red
 6   Green
 7   Green
 8   Green
 9   Green
10   Red
11   Red
12   Red
13   Red
14   Green
15   Green
16   Green
17   Blue
18   Blue
19   Red
20   Blue

I want it grouped into

color  minId
------------
Red     1
Green   6
Red    10
Green  14
Blue   17
Red    19
Blue   20

It is okay to change the name of the colors (e.g., Green1)

The solution should generalize into other aggregations other than just min

Olsgaard
  • 1,006
  • 9
  • 19
  • I think not correct dupe, so reopened. – jezrael Feb 10 '22 at 09:13
  • @jezrael how is it not the same as [dupe](https://stackoverflow.com/questions/40802800/how-to-groupby-consecutive-values-in-pandas-dataframe)? `(df.a != df.a.shift()).cumsum()` – mozway Feb 10 '22 at 09:16
  • @mozway - aggregation min and first? – jezrael Feb 10 '22 at 09:17
  • @mozway - part dupe, agree, total dupe, not – jezrael Feb 10 '22 at 09:17
  • Given the title I think the aggregation doesn't really matter here, it's just an example. Anyway do as you want, but please if you reopen leave the duplicate link. It's a bit annoying to search for existing answer to have this work just dropped from a single click. ;) – mozway Feb 10 '22 at 09:18

1 Answers1

4

You can grouping by consecutive values by helper Series created by compared shifted values and cumsum and then aggregate first and min:

g = df['Colour'].ne(df['Colour'].shift()).cumsum()
df = df.groupby(g).agg(color=('Colour','first'), minId=('ID','min')).reset_index(drop=True)
print (df)
   color  minId
0    Red      1
1  Green      6
2    Red     10
3  Green     14
4   Blue     17
5    Red     19
6   Blue     20
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    Works great. For people who prefer chained statements, this can be reworked into `df.assign(g = lambda x: x.Colour.ne(x.Colour.shift()).cumsum()).groupby('g').agg(color=('Colour','first'), minId=('ID','min'))` – Olsgaard Feb 10 '22 at 09:39
  • Hello everyone! In case I have multiple columns, what's the approach? – LaSanton Sep 16 '22 at 15:43