Find begin and end index of consecutive ones in pandas dataframe

Question

I have the following dataframe:

     A    B    C
0    1    1    1
1    0    1    0
2    1    1    1
3    1    0    1
4    1    1    0
5    1    1    0 
6    0    1    1
7    0    1    0

of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:

Column    From    To    
     A       2     5
     B       1     3         
     B       4     7

first I filter out the ones that are not consecutive for 3 or more values

filtered_df = df.copy().apply(filter, threshold=3)

where

def filter(col, threshold=3):  
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(1)
    col.update(col.loc[mask].replace(1,0))
    return col

filtered_df now look as:

     A    B    C
0    0    1    0
1    0    1    0
2    1    1    0
3    1    0    0
4    1    1    0
5    1    1    0 
6    0    1    0
7    0    1    0

If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.

Perhaps package your code in a function and then apply the function to the dataframe as a whole? You'd of course need to extend the `filter` function to apply it for each col in df.columns. — Vash, Oct 21 '19 at 10:20

jezrael · Accepted Answer · 2019-10-21T10:54:19.550

Use DataFrame.pipe for apply function for all DataFrame.

In first solution get first and last value of consecutive 1 per each columns, add output to lists and last concat:

def f(df, threshold=3): 
    out = []
    for col in df.columns:
        m = df[col].eq(1)
        g = (df[col] != df[col].shift()).cumsum()[m]
        mask = g.groupby(g).transform('count').ge(threshold)
        filt = g[mask].reset_index()
        output = filt.groupby(col)['index'].agg(['first','last'])
        output.insert(0, 'col', col)
        out.append(output)

    return pd.concat(out, ignore_index=True)

Or first reshape by unstack and then apply solution:

def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7

filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
3   C      2     3

Thanks! Both methods work. Is one of them better than the other? — Hoekieee, Oct 21 '19 at 11:52
@Peter - Hard question, second should be slowier if many groups, many columns. The best test in real data both. — jezrael, Oct 21 '19 at 11:53

score 1 · Answer 2 · answered Oct 21 '19 at 10:29

You can use rolling to create a window over the data frame. Then you can apply all your conditions and shift the window back to its start location:

length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)

which prints:

       A      B      C
0  False   True  False
1  False  False  False
2   True  False  False
3   True  False  False
4  False   True  False
5  False   True  False
6    NaN    NaN    NaN
7    NaN    NaN    NaN

Find begin and end index of consecutive ones in pandas dataframe

2 Answers2