2

Fully reproducible dataframe and previous attempts at the end of the question. A similare question has been asked and answered for R


I've got a dataframe with a timestamp dates and values a that are integers in the range [-10, 10]. In the column named above I've identified periods of time where a > 0. Now I'd like to retrieve all beginnings and ends af such periods.

Data

    dates       a   above   pch  per
0   2020-01-01  0   0       0.0  0.0
1   2020-01-02  -7  0       0.0  0.0
2   2020-01-03  -9  0       0.0  1.0
3   2020-01-06  1   1       1.0  0.0
4   2020-01-07  1   1       0.0  0.0
5   2020-01-08  2   1       0.0  0.0
6   2020-01-09  1   1       0.0  0.0
7   2020-01-10  1   1       0.0  -1.0
8   2020-01-13  -2  0       -1.0  1.0
9   2020-01-14  8   1       1.0  0.0
10  2020-01-15  10  1       0.0  0.0

I have been able to do so using an extremely cumbersome, and possibly unstable, way using for loops and identifying period changes in the column pch. To make it arguably easier to index the rows in for loops, I've shifted the same column into a new column named per. And running the snippet below will give med the desired output, which is:

[[Timestamp('2020-01-06 00:00:00'), Timestamp('2020-01-10 00:00:00')],
 [Timestamp('2020-01-14 00:00:00'), Timestamp('2020-01-15 00:00:00')]]

But as you will see, the procedure is far from elegant. So it would be fantastic if any of you pandas pros would have any other suggestions on how to do this.

Complete code:

import pandas as pd
from pandas import Timestamp


df = pd.DataFrame({'dates': {0: Timestamp('2020-01-01 00:00:00'),
          1: Timestamp('2020-01-02 00:00:00'),
          2: Timestamp('2020-01-03 00:00:00'),
          3: Timestamp('2020-01-06 00:00:00'),
          4: Timestamp('2020-01-07 00:00:00'),
          5: Timestamp('2020-01-08 00:00:00'),
          6: Timestamp('2020-01-09 00:00:00'),
          7: Timestamp('2020-01-10 00:00:00'),
          8: Timestamp('2020-01-13 00:00:00'),
          9: Timestamp('2020-01-14 00:00:00'),
          10: Timestamp('2020-01-15 00:00:00')},
         'a': {0: 0, 1: -7, 2: -9, 3: 1, 4: 1, 5: 2, 6: 1, 7: 1, 8: -2, 9: 8, 10: 10},
         'above': {0: 0, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 0, 9: 1, 10: 1},
         'pch': {0: 0.0,
          1: 0.0,
          2: 0.0,
          3: 1.0,
          4: 0.0,
          5: 0.0,
          6: 0.0,
          7: 0.0,
          8: -1.0,
          9: 1.0,
          10: 0.0},
         'per': {0: 0.0,
          1: 0.0,
          2: 1.0,
          3: 0.0,
          4: 0.0,
          5: 0.0,
          6: 0.0,
          7: -1.0,
          8: 1.0,
          9: 0.0,
          10: 0.0}})

# extract beriod starts and ends

# containers
p_s = []
p_e = []

# find beginning of periods where
# previous row for df['per']  = 1, and
# current row for df['per']  =  -1 
for i, p in enumerate(df['a'][1:], 1):
    #print(df['a'].iat[i-1])
    if df['per'].iat[i-1]==1:
        #print(df['dates'].iat[i])
        p_s.append(df['dates'].iat[i])
    if df['per'].iat[i]==-1:
        p_e.append(df['dates'].iat[i])

# every period should have a beginning and and end.
# so if there are more starts than ends, the last
# date available is appended to p_3
if len(p_e) < len(p_s):
    p_e.append(df['dates'].iat[-1])

# transform a list of starts and a list of ends
# into as list of starts and ends
p_corrected = []
for i, p in enumerate(p_s):
    #print(p_s[i])
    new_elem = [p_s[i], p_e[i]]
    p_corrected.append(new_elem)

print(p_corrected)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
vestland
  • 55,229
  • 37
  • 187
  • 305

1 Answers1

3

You can create consecutive groups by cumulative sum with inverted mask and pass to filtered rows only for greater values with aggregate first and last values per groups:

m = df['a'].gt(0)

df1 = df[m].groupby((~m).cumsum())['dates'].agg(['first','last'])
print (df1)
       first       last
a                      
3 2020-01-06 2020-01-10
4 2020-01-14 2020-01-15

L = df1.apply(list, axis=1).tolist()
print (L)
[[Timestamp('2020-01-06 00:00:00'), Timestamp('2020-01-10 00:00:00')], 
 [Timestamp('2020-01-14 00:00:00'), Timestamp('2020-01-15 00:00:00')]]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252