16

I have a df like so:

Count
1
0
1
1
0
0
1
1
1
0

and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:

Count  New_Value
1      0 
0      0
1      1
1      1
0      0
0      0
1      1
1      1 
1      1
0      0

I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

Georgy
  • 12,464
  • 7
  • 65
  • 73
Stefano Potter
  • 3,467
  • 10
  • 45
  • 82
  • Check if `df['Count'][1] == df['Count'][1].shift(1)`, and if so, `1`, else `0`. Then you should `.append()` these values (0 or 1) to an `array`. Then set the first element(`array[0]`) to `0` (default). Then you have to figure out how to `merge/join/plug/concatenate` your `array` into your `dataframe`. 100% untested, but I think this may work... :) – dot.Py Jun 21 '16 at 02:02
  • I may have simplified my question too much, what if I want 3 consecutive occurrences though? I don't think this works then – Stefano Potter Jun 21 '16 at 02:07

2 Answers2

20

You could:

df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count

to get:

   Count  consecutive
0      1            1
1      0            0
2      1            2
3      1            2
4      0            0
5      0            0
6      1            3
7      1            3
8      1            3
9      0            0

From here you can, for any threshold:

threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)

to get:

   Count  consecutive
0      1            0
1      0            0
2      1            1
3      1            1
4      0            0
5      0            0
6      1            1
7      1            1
8      1            1
9      0            0

or, in a single step:

(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)

In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:

 df = pd.concat([df for _ in range(1000)])

%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop

compared to:

%%timeit
l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size    
pd.Series(l)

10 loops, best of 3: 76.7 ms per loop
Stefan
  • 41,759
  • 13
  • 76
  • 81
  • 3
    Here is a one-liner: `df.assign(consecutive=df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size')).query('consecutive > @threshold')` which will work for any consecutive values (not only ones and zeros) – MaxU - stand with Ukraine Jun 27 '16 at 20:41
2

Not sure if this is optimized, but you can give it a try:

from itertools import groupby
import pandas as pd

l = []
for k, g in groupby(df.Count):
    size = sum(1 for _ in g)
    if k == 1 and size >= 2:
        l = l + [1]*size
    else:
        l = l + [0]*size

df['new_Value'] = pd.Series(l)

df

Count   new_Value
0   1   0
1   0   0
2   1   1
3   1   1
4   0   0
5   0   0
6   1   1
7   1   1
8   1   1
9   0   0
Psidom
  • 209,562
  • 33
  • 339
  • 356