4

With a pandas dataframe called 'df' like follows

             A
2015-05-01  True
2015-05-02  True
2015-05-03  False
2015-05-04  False
2015-05-05  False
2015-05-06  False
2015-05-07  True
2015-05-08  False
2015-05-09  False

I want to return a slice that is the longest consecutive number of rows where column 'A' reads 'False'. Can this be done?

vestland
  • 55,229
  • 37
  • 187
  • 305
Runner Bean
  • 4,895
  • 12
  • 40
  • 60

2 Answers2

5

You can use cumsum to detect changes in A column as boolean in python can be summed.

# Test data
df= DataFrame([True, True, False, False, False, False, True, False, False], 
              index=pd.to_datetime(['2015-05-01', '2015-05-02', '2015-05-03',
                                   '2015-05-04', '2015-05-05', '2015-05-06',
                                   '2015-05-07', '2015-05-08', '2015-05-09']), 
              columns=['A'])

# We have to ensure that the index is sorted
df.sort_index(inplace=True)
# Resetting the index to create a column
df.reset_index(inplace=True)

# Grouping by the cumsum and counting the number of dates and getting their min and max
df = df.groupby(df['A'].cumsum()).agg(
    {'index': ['count', 'min', 'max']})

# Removing useless column level
df.columns = df.columns.droplevel()

print(df)
#    count        min        max
# A                             
# 1      1 2015-05-01 2015-05-01
# 2      5 2015-05-02 2015-05-06
# 3      3 2015-05-07 2015-05-09

# Getting the max
df[df['count']==df['count'].max()]

#    count        min        max
# A                             
# 2      5 2015-05-02 2015-05-06
Romain
  • 19,910
  • 6
  • 56
  • 65
  • great, but i cant use 'date' as its the index and if i try df.index instead i get TypeError: unhashable type: 'DatetimeIndex' – Runner Bean Oct 16 '16 at 08:50
  • My df is actually an object called sData which has a dataframe 'df' as a member. So if i try 'index' i get an error, if i try 'df.index' or 'sData.df.index' I also get an error. I just dont know how what to write wher eyou write 'index' in the agg function – Runner Bean Oct 16 '16 at 08:58
  • @RunnerBean You have to reset the index first `df.reset_index(inplace=True)` as demonstrated in my example. – Romain Oct 16 '16 at 08:59
  • my code is sData.df['A'] = sData.df['S_missing'].cumsum() result = sData.df.groupby('A').agg({'sData.df.index': ['count', 'min', 'max']}) – Runner Bean Oct 16 '16 at 08:59
1

Sorry to bring back an old post, but I noticed the result of Romain's answer is slightly off- the counts are incorrect and it causes an inaccurate result. There should be 4 items in the count column: [2, 4, 1, 2] with the max being 4.

To demonstrate the problem- I've broken it down a bit (df is identical to the accepted answer above). You can see that the resulting groups are incorrect:

# sort
dfS = df.sort_index(inplace=True)
# reset
dfSR = dfS.reset_index(inplace=True)
# group
dfG = dfSR.groupby(df['A'].cumsum())

# show resulting groups
for group in dfG: print(group)

# (1,        index     A
# 0 2015-05-01  True)
# (2,        index      A
# 1 2015-05-02   True
# 2 2015-05-03  False
# 3 2015-05-04  False
# 4 2015-05-05  False
# 5 2015-05-06  False)
# (3,        index      A
# 6 2015-05-07   True
# 7 2015-05-08  False
# 8 2015-05-09  False)

Thanks to the answer by DSM here and of course Romain's answer, combining the techniques of both posts gets the answer. They are already explained in the posts they come from, so I'll leave it at the code below.

import pandas as pd

df = pd.DataFrame([True, True, False, False, False, False, True, False, False], 
              index=pd.to_datetime(['2015-05-01', '2015-05-02', '2015-05-03',
                                   '2015-05-04', '2015-05-05', '2015-05-06',
                                   '2015-05-07', '2015-05-08', '2015-05-09']), 
              columns=['A'])

df.sort_index(inplace=True)
df.reset_index(inplace=True)

dfBool = df['A'] != df['A'].shift()
dfCumsum = dfBool.cumsum()

groups = df.groupby(dfCumsum)

for g in groups: print(g)

groupCounts = groups.agg({'index':['count', 'min', 'max']})
groupCounts.columns = groupCounts.columns.droplevel()

print('\n', groupCounts, '\n')

maxCount = groupCounts[groupCounts['count'] == groupCounts['count'].max()]

print(maxCount, '\n')

Output:

(1,        index     A
0 2015-05-01  True
1 2015-05-02  True)
(2,        index      A
2 2015-05-03  False
3 2015-05-04  False
4 2015-05-05  False
5 2015-05-06  False)
(3,        index     A
6 2015-05-07  True)
(4,        index      A
7 2015-05-08  False
8 2015-05-09  False)

    count        min        max
A                             
1      2 2015-05-01 2015-05-02
2      4 2015-05-03 2015-05-06
3      1 2015-05-07 2015-05-07
4      2 2015-05-08 2015-05-09 

   count        min        max
A                             
2      4 2015-05-03 2015-05-06
HairyP
  • 11
  • 4