3

I have the following dataframe:

     A    B    C
0    1    1    1
1    0    1    0
2    1    1    1
3    1    0    1
4    1    1    0
5    1    1    0 
6    0    1    1
7    0    1    0

of which I want to know the start and end index when the values are 1 for 3 or more consecutive values per column. Desired outcome:

Column    From    To    
     A       2     5
     B       1     3         
     B       4     7

first I filter out the ones that are not consecutive for 3 or more values

filtered_df = df.copy().apply(filter, threshold=3)

where

def filter(col, threshold=3):  
    mask = col.groupby((col != col.shift()).cumsum()).transform('count').lt(threshold)
    mask &= col.eq(1)
    col.update(col.loc[mask].replace(1,0))
    return col

filtered_df now look as:

     A    B    C
0    0    1    0
1    0    1    0
2    1    1    0
3    1    0    0
4    1    1    0
5    1    1    0 
6    0    1    0
7    0    1    0

If the dataframe would have only one column with zeros and ones the result could be achieved as in How to use pandas to find consecutive same data in time series. However, I am struggeling to do something similar for multiple columns at once.

Hoekieee
  • 367
  • 1
  • 15
  • Perhaps package your code in a function and then apply the function to the dataframe as a whole? You'd of course need to extend the `filter` function to apply it for each col in df.columns. – Vash Oct 21 '19 at 10:20

2 Answers2

2

Use DataFrame.pipe for apply function for all DataFrame.

In first solution get first and last value of consecutive 1 per each columns, add output to lists and last concat:

def f(df, threshold=3): 
    out = []
    for col in df.columns:
        m = df[col].eq(1)
        g = (df[col] != df[col].shift()).cumsum()[m]
        mask = g.groupby(g).transform('count').ge(threshold)
        filt = g[mask].reset_index()
        output = filt.groupby(col)['index'].agg(['first','last'])
        output.insert(0, 'col', col)
        out.append(output)

    return pd.concat(out, ignore_index=True)

Or first reshape by unstack and then apply solution:

def f(df, threshold=3):

    df1 = df.unstack().rename_axis(('col','idx')).reset_index(name='val')
    m = df1['val'].eq(1)
    g = (df1['val'] != df1.groupby('col')['val'].shift()).cumsum()
    mask = g.groupby(g).transform('count').ge(threshold) & m
    return (df1[mask].groupby([df1['col'], g])['idx']
                    .agg(['first','last'])
                    .reset_index(level=1, drop=True)
                    .reset_index())


filtered_df = df.pipe(f, threshold=3)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
filtered_df = df.pipe(f, threshold=2)
print (filtered_df)
  col  first  last
0   A      2     5
1   B      0     2
2   B      4     7
3   C      2     3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You can use rolling to create a window over the data frame. Then you can apply all your conditions and shift the window back to its start location:

length = 3
window = df.rolling(length)
mask = (window.min() == 1) & (window.max() == 1)
mask = mask.shift(1 - length)
print(mask)

which prints:

       A      B      C
0  False   True  False
1  False  False  False
2   True  False  False
3   True  False  False
4  False   True  False
5  False   True  False
6    NaN    NaN    NaN
7    NaN    NaN    NaN
a_guest
  • 34,165
  • 12
  • 64
  • 118