0

I have read related questions like this one and this blog post.

Unfortunately I am unable to modify the solutions to my needs.

Consider a Series with a DatetimeIndex which may look like this:

enter image description here

Code to instantiate an example:

s = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0], index=pd.date_range(start=0, freq='1d', periods=12, name='A')

Ultimately, I want to get the result

  (t4 - t1)
+ (t8 - t5)
+ (t10 - t8)

This means I need to identify streaks of 1 padded with 0 on each side. I can do everything after that myself, i.e. grouping by streak (possibly with cumcount) and diffing the first and last timestamp in each group.

There are some special cases when the Series starts/ends with a 1.In this case I want to treat it as if it was preceded/followed by a 0 at the same timestamp, e.g.

enter image description here


Attempt so far:

I'm going to concat some sub-solutions for easier visualization.

  1. Pad the series with a zero on each end, to avoid special cases.

    s = pd.Series([0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0], index=pd.date_range(start=0, freq='1d', periods=12, name='A')
    s = pd.concat([pd.Series([0], index=s.index[:1]), s, pd.Series([0], index=s.index[-1:])])
    
  2. Get the last 0 before and the first 0 after a streak of ones.

     >>> tmp = pd.concat([s, s.diff(-1).eq(-1).astype(int).rename('starter'), s.diff(1).eq(-1).astype(int).rename('ender')], axis=1)
     >>> tmp 
                 A  starter  ender
     1970-01-01  0        0      0
     1970-01-02  0        1      0
     1970-01-03  1        0      0
     1970-01-04  1        0      0
     1970-01-05  0        0      1
     1970-01-06  0        1      0
     1970-01-07  1        0      0
     1970-01-08  1        0      0
     1970-01-09  0        1      1
     1970-01-10  1        0      0
     1970-01-11  0        0      1
     1970-01-12  0        0      0
    
  3. Fill single zero gaps in the 'A' column with 1 because they don't change the desired result. (This step might not be necessary but helps the visualization.)

     >>> tmp.loc[(both := tmp['starter'].eq(1) & tmp['ender'].eq(1)), 'A'] = 1
     >>> tmp
                 A  starter  ender
     1970-01-01  0        0      0
     1970-01-02  0        1      0
     1970-01-03  1        0      0
     1970-01-04  1        0      0
     1970-01-05  0        0      1
     1970-01-06  0        1      0
     1970-01-07  1        0      0
     1970-01-08  1        0      0
     1970-01-09  1        1      1
     1970-01-10  1        0      0
     1970-01-11  0        0      1
     1970-01-12  0        0      0
    
  4. Adjust the 'starter' and 'ender' columns.

     >>> tmp.loc[both, ['starter', 'ender']] = 0
     >>> tmp 
                 A  starter  ender
     1970-01-01  0        0      0
     1970-01-02  0        1      0
     1970-01-03  1        0      0
     1970-01-04  1        0      0
     1970-01-05  0        0      1
     1970-01-06  0        1      0
     1970-01-07  1        0      0
     1970-01-08  1        0      0
     1970-01-09  1        0      0
     1970-01-10  1        0      0
     1970-01-11  0        0      1
     1970-01-12  0        0      0
    

And this is where I'm stuck.

actual_panda
  • 1,178
  • 9
  • 27

2 Answers2

1

Here is a solution to your problem:

>>> s = pd.Series([0, 1, 1, 0, 1, 0], index=pd.date_range(start=0, freq='1d', periods=6))

1970-01-01    0
1970-01-02    1
1970-01-03    1
1970-01-04    0
1970-01-05    1
1970-01-06    0
>>> s2 = s.diff(1)

1970-01-01    NaN
1970-01-02    1.0
1970-01-03    0.0
1970-01-04   -1.0
1970-01-05    1.0
1970-01-06   -1.0
>>> s3 = s2.loc[s == 0].reset_index(name="d0")

    index       d0
0   1970-01-01  NaN
1   1970-01-04  -1.0
2   1970-01-06  -1.0
>>> s4 = s3["index"].diff(1).loc[s3["d0"] == -1]

1   3 days
2   2 days

>>> s4.sum().days
5

Putting it all together,

s.diff(1) \
    .loc[s == 0] \
    .reset_index(name="val_diff") \
    .assign(date_diff=lambda x: x["index"].diff(1)) \
    .loc[lambda x: x.val_diff == -1] \
    .date_diff.sum()

For the special cases, add extra rows to the top and bottom of the dataframe if they start or end with value = 1. Something like:

if s.iloc[0] == 1:
    s = pd.concat([
            pd.Series([0], index=s.index[:1]),
            s
    ])
jacob
  • 160
  • 8
0

I was almost done in the attempt from the question, here's the completion:

result = (s.index[s.diff(1).eq(-1)] - s.index[s.diff(-1).eq(-1)]).sum()

Assumes s starts and ends with a zero. Otherwise apply padding with zeros first.

actual_panda
  • 1,178
  • 9
  • 27