5

I'm trying to get the max count of consecutive 0 values from a given data frame with id,date,value columns from a data frame on pandas which look's like that:

id    date       value
354   2019-03-01 0
354   2019-03-02 0
354   2019-03-03 0
354   2019-03-04 5
354   2019-03-05 5 
354   2019-03-09 7
354   2019-03-10 0
357   2019-03-01 5
357   2019-03-02 5
357   2019-03-03 8
357   2019-03-04 0
357   2019-03-05 0
357   2019-03-06 7
357   2019-03-07 7
540   2019-03-02 7
540   2019-03-03 8
540   2019-03-04 9
540   2019-03-05 8
540   2019-03-06 7
540   2019-03-07 5
540   2019-03-08 2 
540   2019-03-09 3
540   2019-03-10 2

The desired result will be grouped by the Id and will look like this:

id   max_consecutive_zeros
354  3
357  2
540  0

I've achieved what i want with a for but it gets really slow when you are working with huge pandas dataframes, i've found some similar solutions but it didn't work with my problem at all.

Wel
  • 201
  • 2
  • 13

3 Answers3

1

Here is one way we need to create the additional key for groupby then , just need groupby this key and id

s=df.groupby('id').value.apply(lambda x : x.ne(0).cumsum())
df[df.value==0].groupby([df.id,s]).size().max(level=0).reindex(df.id.unique(),fill_value=0)
Out[267]: 
id
354    3
357    2
540    0
dtype: int64
BENY
  • 317,841
  • 20
  • 164
  • 234
  • way too abstract for me (i'm not used to some things like lambda and etc), could you explain a bit of what it does? – Wel Aug 05 '19 at 19:48
1

Create groupID m for consecutive rows of same value. Next, groupby on id and m and call value_counts, and .loc on multiindex to slice only 0 value of the right-most index level. Finally, filter out duplicates index by duplicated in id and reindex to create 0 value for id having no 0 count

m = df.value.diff().ne(0).cumsum().rename('gid')    
#Consecutive rows having the same value will be assigned same IDNumber by this command. 
#It is the way to identify a group of consecutive rows having the same value, so I called it groupID.

df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1)
#this groupby groups consecutive rows of same value per ID into separate groups.
#within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`.

df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0)
#There're several groups of value `0` per `id`. We want only group of highest count. 
#`value_count` already sorted number of count descending, so we just need to pick 
#the top one of duplicates by slicing on True/False mask of `duplicated`.
#finally, `reindex` adding any `id` doesn't have value 0 in original `df`.
#Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby

Out[315]:
id
354    3
357    2
540    0
Name: value, dtype: int64
Andy L.
  • 24,909
  • 4
  • 17
  • 29
  • I didn't get what the df m does there, could you explain a bit more? – Wel Aug 06 '19 at 12:57
  • @Wel: Sorry, I haven't seen your comment until StackOverflow notify me today. Have you figured out the code above? – Andy L. Aug 07 '19 at 17:04
  • not fully, but i've tried it with the above example and it did worked the same way as my loop and like 100000000000 times faster, i just haven't figured out what is m yet. – Wel Aug 07 '19 at 17:34
  • @Wel: I am not good at words. However, I try my best to add some explanations in the code. Please check the updated. You may also unchain each command above and run each sepately in console to see each output to get more ideas about their purposes. – Andy L. Aug 07 '19 at 18:36
  • just one more question, what is the reason of the droplevel(-1)? what is the motive to use a negative value there? – Wel Aug 08 '19 at 20:42
  • @Wel: just for cosmestic reason. `groupby` and `value_count` returns 2-level multiindex series. `droplevel(-1)` drops the last level, keep only the first level which is `id` to show your desired output – Andy L. Aug 08 '19 at 23:37
  • Breaking m = df.value.diff().ne(0).cumsum().rename('gid') down as far as I understand it: 1. rename('gid') renames the resulting series 2. cumsum() is the running total (always add yourself to the total from right before) 3. ne(0) means 'is the object not equal to zero?', the object being the result of the function preceding. So it returns a boolean (True of False) 4. diff() is a numpy method that returns the differences of a number with its left neighbor in an array (e.g. [diff(arr[1,2,3,4,6]) = [1,1,1,2] 5. df.value refers to the Series being used in this calculation – Nesha25 Apr 01 '22 at 00:35
  • So, the goal is that while the numbers are the same, the ne() of the difference between a value and the preceding value will continue to return False (i.e. it IS equal to zero, i.e. value minus previous_value is 0). But as soon as value minus previous value is NOT 0, we get a True. As soon as this happens, the cumsum gets a value other than zero to add to the list (1). So the GID goes up by 1,. Stays at 1 unitl there is another True value, at which point GID goies up again. – Nesha25 Apr 01 '22 at 00:36
0

you could do :

df.groupby('id').value.apply(lambda x : ((x.diff() !=0).cumsum()).where(x ==0,\
                                       np.nan).value_counts().max()).fillna(0)

Output

id
354    3.0
357    2.0
540    0.0
Name: value, dtype: float64
Ayoub ZAROU
  • 2,387
  • 6
  • 20