calculate max run length of groups of consecutive values in pandas

Question

Given a dataset like this:

values = ([ 'motorway' ] * 5) + ([ 'link' ] * 3) + ([ 'motorway' ] * 7)

df = pd.DataFrame.from_dict({
  'timestamp': pd.date_range(start='2018-1-1', end='2018-1-2', freq='s').tolist()[:len(values)],
  'road_type': values,
})
df.set_index('timestamp')
df['delta_t'] = (df['timestamp'] - df['timestamp'].shift()).fillna(0)

I want to have the max sums of delta_t per group of consecutive road_types; given that delta_t is going to be 1s in this sample case, I want to find motorway: 7s and link: 3s. In practice there will be more road_types, and delta_t will vary.

edit: the solution provided here seems similar but does not sum the times nor does it select the largest of each groups.

Possible duplicate of [Pandas DataFrame: How to groupby consecutive values](https://stackoverflow.com/questions/40802800/pandas-dataframe-how-to-groupby-consecutive-values) — Niklas Mertsch, Oct 20 '18 at 15:49

Peter Leimbigler · Accepted Answer · 2018-10-20T17:56:05.110

1

Create a new column that labels each "run" of identical road types with a unique integer, then group by that column and sum:

df['run'] = (df['road_type'] != df['road_type'].shift()).astype(int).cumsum()

df
             timestamp road_type  delta_t  run
0  2018-01-01 00:00:00  motorway 00:00:00    1
1  2018-01-01 00:00:01  motorway 00:00:01    1
2  2018-01-01 00:00:02  motorway 00:00:01    1
3  2018-01-01 00:00:03  motorway 00:00:01    1
4  2018-01-01 00:00:04  motorway 00:00:01    1
5  2018-01-01 00:00:05      link 00:00:01    2
6  2018-01-01 00:00:06      link 00:00:01    2
7  2018-01-01 00:00:07      link 00:00:01    2
8  2018-01-01 00:00:08  motorway 00:00:01    3
9  2018-01-01 00:00:09  motorway 00:00:01    3
10 2018-01-01 00:00:10  motorway 00:00:01    3
11 2018-01-01 00:00:11  motorway 00:00:01    3
12 2018-01-01 00:00:12  motorway 00:00:01    3
13 2018-01-01 00:00:13  motorway 00:00:01    3
14 2018-01-01 00:00:14  motorway 00:00:01    3


df.groupby('run').agg({'road_type': 'first', 'delta_t': 'sum'}).reset_index(drop=True).groupby('road_type').max()

           delta_t
road_type         
link      00:00:03
motorway  00:00:07

edited Oct 20 '18 at 17:56

answered Oct 20 '18 at 15:56

Peter Leimbigler

10,775
1
23
37

Of these I only want the last two rows though. – retorquere Oct 20 '18 at 17:43
1

wait -- df.groupby('run').agg({'road_type': 'first', 'delta_t': 'sum'}).reset_index(drop=True).groupby('road_type').max() does the job! Thanks! – retorquere Oct 20 '18 at 17:50
@retorquere, sorry, I didn't read your question carefully enough! I'll edit your fix into the answer :) – Peter Leimbigler Oct 20 '18 at 17:55
No worries, this has helped me a lot. What does the "'road_type': 'first'" do? – retorquere Oct 20 '18 at 18:04
That aggregates the `road_type` column by taking the first value of each group, but all the members of each group are the same, so '`first`' could be replaced with any aggfunc that works on non-numeric data types. – Peter Leimbigler Oct 20 '18 at 20:08

calculate max run length of groups of consecutive values in pandas

1 Answers1