Count consecutive values and average/min/max time for each group of values

Question

I have a pandas dataframe with a datetime index along with multiple columns of 1 an 0's. I wish to clump those 1's and 0's up to figure out how many are in a sequence. I then wish to determine the minimum, mean, and maximum time the clumps were in each column using the datetime index. I only need the 1's counted since in essence, they are the "on" values in this case. If there is only one 1, then the time from that 1 to the next row will be the timeframe.

So far I have seen the first part done this way from this post Counting cons values and adding them

y * (y.groupby((y != y.shift()).cumsum()).cumcount() + 1)

The index is "year-month-day hour:minute:second"

An example of the Dataframe is as follows.

                                   col1     col2
datetime
2021-05-24 00:09:22                    1       0 
2021-05-24 00:09:24                    1       0  
2021-05-24 00:09:25                    0       1 
2021-05-24 00:09:26                    1       0 
2021-05-24 00:09:27                    0       0

With a wanted output like this for a column in seconds or minutes. The output has to use the datetime index rather of just calculating the consecutive values and multiplying the time since the datetime is not consistent.

col1              col2
min  1               1
max  3               1
mean 2               1

Could you explain your expected output? I understand why max is 2 seconds, but how do you calculate the time for a block that consists of only one row? And do you consider runs of 0s as well as runs of 1s? Maybe you could expand the example dataframe a bit to make this more clear, and also give the expected output for col2. — Arne, Jul 17 '21 at 14:21
Just added additional information, hope that's enough to fix the lack of information. — Xderic, Jul 17 '21 at 14:37

Timus · Answer 1 · 2021-07-17T20:49:45.237

You could try this:

def count_secs(ser):
    return (ser.index[-1] - ser.index[0]).seconds + 1

def min_max_mean(col):
    if 1 not in col.values:
        return 0, 0, 0
    groups = (col != col.shift(1))[col.eq(1)].cumsum()
    counts = groups.groupby(groups.values).apply(count_secs)
    return counts.min(), counts.max(), counts.mean()

df = df.apply(min_max_mean, axis='index')
df.index = ['min', 'max', 'mean']

Result for df

                     col1  col2
datetime                       
2021-05-24 00:09:22     1     0
2021-05-24 00:09:24     1     0
2021-05-24 00:09:25     0     1
2021-05-24 00:09:26     1     0
2021-05-24 00:09:27     0     0

is

      col1  col2
min    1.0   1.0
max    3.0   1.0
mean   2.0   1.0

EDIT: My interpretation of your question was that you want to account for seconds that are missing in the index (your example output suggest that this is what you are looking for). If that's not the case - see @mozway's answer - then a simpler version without the count_secs function would suffice:

def min_max_mean(col):
    if 1 not in col.values:
        return 0, 0, 0
    groups = (col != col.shift(1))[col.eq(1)].cumsum()
    counts = groups.groupby(groups.values).count()
    return counts.min(), counts.max(), counts.mean()

df = df.apply(min_max_mean, axis='index')
df.index = ['min', 'max', 'mean']

Result:

      col1  col2
min    1.0   1.0
max    2.0   1.0
mean   1.5   1.0

Thank you. This works very well, especially if I have more columns later. — Xderic, Jul 17 '21 at 20:43

score 1 · Accepted Answer · answered Jul 17 '21 at 19:38

The exact operations that you want are not fully clear to me, but if I understand correctly that you want to have stats on the duration of stretches of 1's, you can use the following approach:

First, make sure that the index uses the datetime type:

df.index = pd.to_datetime(df.index)

I then calculate several identtifiers:

first element of each stretch (first_stretch)
last element of each stretch (last_stretch)
groups of stretches (stretch_group)
the time difference in seconds from the first value (timedelta)
the time difference in seconds between consecutive rows (time_diff)
the cumulated time in seconds within each stretch (cum_diff)

df['first_stretch'] = df['col1']&df['col1'].shift(1).fillna(0).eq(0)
df['last_stretch']  = (df['col1']-df['col1'].shift(-1)).eq(1)
df['stretch_group'] = df['first_stretch'].cumsum().mask(~df['col1'].astype(bool))
df['timedelta'] = (df.index-df.index[0]).total_seconds().astype(int)
df['timediff'] = df['timedelta'].diff(1).fillna(0).astype(int)
df['cum_diff'] = df.groupby('stretch_group')['timediff'].cumsum()*df['col1']

                     col1  col2  first_stretch  last_stretch  stretch_group  timedelta  timediff  cum_diff
datetime                                                                                                  
2021-05-24 00:09:22     1     0           True         False            1.0          0         0         0
2021-05-24 00:09:24     1     0          False          True            1.0          2         2         2
2021-05-24 00:09:25     0     1          False         False            NaN          3         1         0
2021-05-24 00:09:26     1     0           True          True            2.0          4         1         1
2021-05-24 00:09:27     0     0          False         False            NaN          5         1         0

Maybe you don't need all those, but this makes it easy to do various calculations. For example, if you want to get the min/max/mean duration of each stretch:

>>> df[df.last_stretch]['timediff'].agg(['mean', 'min', 'max'])
mean    1.5
min     1.0
max     2.0

The values are not strictly identical, so please clarify on the exact bounds of your time intervals.

Count consecutive values and average/min/max time for each group of values

2 Answers2