Group streaks of ones (with a twist)

Question

I have read related questions like this one and this blog post.

Unfortunately I am unable to modify the solutions to my needs.

Consider a Series with a DatetimeIndex which may look like this:

Code to instantiate an example:

s = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0], index=pd.date_range(start=0, freq='1d', periods=12, name='A')

Ultimately, I want to get the result

  (t4 - t1)
+ (t8 - t5)
+ (t10 - t8)

This means I need to identify streaks of 1 padded with 0 on each side. I can do everything after that myself, i.e. grouping by streak (possibly with cumcount) and diffing the first and last timestamp in each group.

There are some special cases when the Series starts/ends with a 1.In this case I want to treat it as if it was preceded/followed by a 0 at the same timestamp, e.g.

Attempt so far:

I'm going to concat some sub-solutions for easier visualization.

Pad the series with a zero on each end, to avoid special cases.

s = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0], index=pd.date_range(start=0, freq='1d', periods=12, name='A')
s = pd.concat([pd.Series([0], index=s.index[:1]), s, pd.Series([0], index=s.index[-1:])])

Get the last 0 before and the first 0 after a streak of ones.

 >>> tmp = pd.concat([s, s.diff(-1).eq(-1).astype(int).rename('starter'), s.diff(1).eq(-1).astype(int).rename('ender')], axis=1)
 >>> tmp 
             A  starter  ender
 1970-01-01  0        0      0
 1970-01-02  0        1      0
 1970-01-03  1        0      0
 1970-01-04  1        0      0
 1970-01-05  0        0      1
 1970-01-06  0        1      0
 1970-01-07  1        0      0
 1970-01-08  1        0      0
 1970-01-09  0        1      1
 1970-01-10  1        0      0
 1970-01-11  0        0      1
 1970-01-12  0        0      0

Fill single zero gaps in the 'A' column with 1 because they don't change the desired result. (This step might not be necessary but helps the visualization.)

 >>> tmp.loc[(both := tmp['starter'].eq(1) & tmp['ender'].eq(1)), 'A'] = 1
 >>> tmp
             A  starter  ender
 1970-01-01  0        0      0
 1970-01-02  0        1      0
 1970-01-03  1        0      0
 1970-01-04  1        0      0
 1970-01-05  0        0      1
 1970-01-06  0        1      0
 1970-01-07  1        0      0
 1970-01-08  1        0      0
 1970-01-09  1        1      1
 1970-01-10  1        0      0
 1970-01-11  0        0      1
 1970-01-12  0        0      0

Adjust the 'starter' and 'ender' columns.

 >>> tmp.loc[both, ['starter', 'ender']] = 0
 >>> tmp 
             A  starter  ender
 1970-01-01  0        0      0
 1970-01-02  0        1      0
 1970-01-03  1        0      0
 1970-01-04  1        0      0
 1970-01-05  0        0      1
 1970-01-06  0        1      0
 1970-01-07  1        0      0
 1970-01-08  1        0      0
 1970-01-09  1        0      0
 1970-01-10  1        0      0
 1970-01-11  0        0      1
 1970-01-12  0        0      0

And this is where I'm stuck.

jacob · Answer 1 · 2021-06-11T14:50:17.080

Here is a solution to your problem:

>>> s = pd.Series([0, 1, 1, 0, 1, 0], index=pd.date_range(start=0, freq='1d', periods=6))

1970-01-01    0
1970-01-02    1
1970-01-03    1
1970-01-04    0
1970-01-05    1
1970-01-06    0

>>> s2 = s.diff(1)

1970-01-01    NaN
1970-01-02    1.0
1970-01-03    0.0
1970-01-04   -1.0
1970-01-05    1.0
1970-01-06   -1.0

>>> s3 = s2.loc[s == 0].reset_index(name="d0")

    index       d0
0   1970-01-01  NaN
1   1970-01-04  -1.0
2   1970-01-06  -1.0

>>> s4 = s3["index"].diff(1).loc[s3["d0"] == -1]

1   3 days
2   2 days

>>> s4.sum().days
5

Putting it all together,

s.diff(1) \
    .loc[s == 0] \
    .reset_index(name="val_diff") \
    .assign(date_diff=lambda x: x["index"].diff(1)) \
    .loc[lambda x: x.val_diff == -1] \
    .date_diff.sum()

For the special cases, add extra rows to the top and bottom of the dataframe if they start or end with value = 1. Something like:

if s.iloc[0] == 1:
    s = pd.concat([
            pd.Series([0], index=s.index[:1]),
            s
    ])

score 0 · Accepted Answer · answered Jul 25 '21 at 08:13

0

I was almost done in the attempt from the question, here's the completion:

result = (s.index[s.diff(1).eq(-1)] - s.index[s.diff(-1).eq(-1)]).sum()

Assumes s starts and ends with a zero. Otherwise apply padding with zeros first.

answered Jul 25 '21 at 08:13

actual_panda

1,178
9
27

Group streaks of ones (with a twist)

2 Answers2