15

Let's say I have pd.Series like below

s = pd.Series([False, True, False,True,True,True,False, False])    

0    False
1     True
2    False
3     True
4     True
5     True
6    False
7    False
dtype: bool

I want to know how long is the longest True sequence, in this example, it is 3.

I tried it in a stupid way.

s_list = s.tolist()
count = 0
max_count = 0
for item in s_list:
    if item:
        count +=1
    else:
        if count>max_count:
            max_count = count
        count = 0
print(max_count)

It will print 3, but in a Series of all True, it will print 0

Dawei
  • 1,046
  • 12
  • 21

6 Answers6

27

Option 1
Use a the series itself to mask the cumulative sum of the negation. Then use value_counts

(~s).cumsum()[s].value_counts().max()

3

explanation

  1. (~s).cumsum() is a pretty standard way to produce distinct True/False groups

    0    1
    1    1
    2    2
    3    2
    4    2
    5    2
    6    3
    7    4
    dtype: int64
    
  2. But you can see that the group we care about is represented by the 2s and there are four of them. That's because the group is initiated by the first False (which becomes True with (~s)). Therefore, we mask this cumulative sum with the boolean mask we started with.

    (~s).cumsum()[s]
    
    1    1
    3    2
    4    2
    5    2
    dtype: int64
    
  3. Now we see the three 2s pop out and we just have to use a method to extract them. I used value_counts and max.


Option 2
Use factorize and bincount

a = s.values
b = pd.factorize((~a).cumsum())[0]
np.bincount(b[a]).max()

3

explanation
This is a similar explanation as for option 1. The main difference is in how I a found the max. I use pd.factorize to tokenize the values into integers ranging from 0 to the total number of unique values. Given the actual values we had in (~a).cumsum() we didn't strictly need this part. I used it because it's a general purpose tool that could be used on arbitrary group names.

After pd.factorize I use those integer values in np.bincount which accumulates the total number of times each integer is used. Then take the maximum.


Option 3
As stated in the explanation of option 2, this also works:

a = s.values
np.bincount((~a).cumsum()[a]).max()

3
pault
  • 41,343
  • 15
  • 107
  • 149
piRSquared
  • 285,575
  • 57
  • 475
  • 624
5

I think this could work

pd.Series(s.index[~s].values).diff().max()-1
Out[57]: 3.0

Also outside pandas' we can back to python groupby

from itertools import groupby
max([len(list(group)) for key, group in groupby(s.tolist())])
Out[73]: 3

Update :

from itertools import compress
max(list(compress([len(list(group)) for key, group in groupby(s.tolist())],[key for key, group in groupby(s.tolist())])))
Out[84]: 3
BENY
  • 317,841
  • 20
  • 164
  • 234
2

Edit: As piRSquared mentioned, my previous solution needs to append two False at the beginning and at the end of the series. piRSquared kindly gave an answer based on that.

(np.diff(np.flatnonzero(np.append(True, np.append(~s.values, True)))) - 1).max()

My original trial is

(np.diff(s.where(~s).dropna().index.values) - 1).max()

(This will not give the correct answer if the longest True starts at the beginning or ends at the end as pointed out by piRSquared. Please use the solution above given by piRSquared. This work remains only for explanation.)

Explanation:

This finds the indices of the False parts and by finding the gaps between the indices of False, we can know the longest True.

  • s.where(s == False).dropna().index.values finds all the indices of False

    array([0, 2, 6, 7])
    

We know that Trues live between the Falses. Thus, we can use np.diff to find the gaps between these indices.

    array([2, 4, 1])
  • Minus 1 in the end as Trues lies between these indices.

  • Find the maximum of the difference.

Tai
  • 7,684
  • 3
  • 29
  • 49
  • 1
    Umm nice solution – BENY Feb 21 '18 at 02:46
  • 1
    Agreed this is nice. However, if you have the longest `True` sequence at the beginning or the end of the array, your diff will not catch it. You need to append `False` to the ends, then do it. Also, you don't need `s == False`, `~s` will do. – piRSquared Feb 21 '18 at 02:53
  • 1
    This is how I would have done it. Feel free to add it to your answer as it is the same concept, only if you want to (-: `(np.diff(np.flatnonzero(np.append(True, np.append(~s.values, True)))) - 1).max()` Though I'd suggest formatting nicer. – piRSquared Feb 21 '18 at 02:56
  • 1
    @piRSquared thank you for offering the solution to this. I appreciate it. – Tai Feb 21 '18 at 03:01
2

You can use (inspired by @piRSquared answer):

s.groupby((~s).cumsum()).sum().max()
Out[513]: 3.0

Another option to use a lambda func to do this.

s.to_frame().apply(lambda x: s.loc[x.name:].idxmin() - x.name, axis=1).max()
Out[429]: 3
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
2

Your code was actually very close. It becomes perfect with a minor fix:

count = 0
maxCount = 0
for item in s:
    if item:
        count += 1
        if count > maxCount:
            maxCount = count
    else:
        count = 0
print(maxCount)
FatihAkici
  • 4,679
  • 2
  • 31
  • 48
1

I'm not exactly sure how to do it with pandas but what about using itertools.groupby?

>>> import pandas as pd
>>> s = pd.Series([False, True, False,True,True,True,False, False])
>>> max(sum(1 for _ in g) for k, g in groupby(s) if k)
3
G_M
  • 3,342
  • 1
  • 9
  • 23