3

So according to this answer, it is better not to iterate over rows in a Pandas DataFrame. However, I don't know how to solve my problem without using a for-loop.

I need to detect any consecutive repetition (three or more times) in a particular column. So, for example, if the value 0 appears in three consecutive rows for a particular ID, I want to know the ID.

ID     Value
1       0
1       0.5
1       0   <--- I need this ID, because there are three consecutive 0s.
1       0
1       0
1       0.2
2       0.1
2       0   <--- Not this one! It only appears twice in a row for this ID.
2       0
3       0
3       0

Maybe it's worth mentioning that it's a time series, so the order is crucial.

Zarif
  • 587
  • 1
  • 4
  • 26

4 Answers4

5

You can do something like:

f = lambda x:np.diff(np.r_[0,np.flatnonzero(np.diff(x))+1,x.size])[0]
df[(df[['ID','Value']].ne(df[['ID','Value']].shift()).cumsum()
          .groupby(['ID','Value'])['Value'].transform(f).ge(3))]

   ID  Value
2   1    0.0
3   1    0.0
4   1    0.0
anky
  • 74,114
  • 11
  • 41
  • 70
  • Yes but what if there are multiple Ids, you still print the result in a single dataframe – Kaies LAMIRI Jul 04 '19 at 12:04
  • 1
    This will return 3 same consecutives for _any_ `Value`.... not just consecutive `0`'s – Chris Adams Jul 04 '19 at 12:06
  • 1
    @ChrisA is it intended just for 0? I thought 0 is an example here based on "I need to detect any consecutive repetition " :) – anky Jul 04 '19 at 12:07
  • 1
    @anky_91 yeh, could be. I've probably misinterpreted – Chris Adams Jul 04 '19 at 12:09
  • @anky_91 Thanks for your answer. Unfortunately, I'm getting the following error: 'DataFrameGroupBy' object has no attribute 'Value'. Also, the dataframe is actually a time series so the order is crucial. Is it possible that `.groupby` messes the order? – Zarif Jul 04 '19 at 12:27
  • I feel like this might be incorrect if there are 3 pairs of consecutive equal `Value`. – Quang Hoang Jul 04 '19 at 12:27
  • @QuangHoang edited the answer which I think should work. – anky Jul 04 '19 at 12:41
  • 1
    Yeah, looks like it does once you groupby two series. – Quang Hoang Jul 04 '19 at 12:43
  • @anky_91 Thank you for the improved answer, but I'm still getting an error: AttributeError: 'DataFrame' object has no attribute 'Value' – Zarif Jul 04 '19 at 12:44
  • 1
    @Zarif replace Value with the original column name. In the example its Value – anky Jul 04 '19 at 12:47
  • @anky_91 Hmm seems like it still works if it's not consecutive. I think `groupby` messes it up. The order is important, because it's based on time. – Zarif Jul 04 '19 at 12:53
  • @Zarif it should work. Non consecutive gets 1, 2 consecutive gets 2, 3 gets 3 and so on. I tested on additional data and it works fine. Probably you can add a sample showing how it doesnt work. Thanks – anky Jul 04 '19 at 13:26
  • @anky_91 the output of the following dataframe should be only 3, but I get 1 and 3: `data = pd.DataFrame({'id': [1, 1, 2, 1, 1, 2, 1, 1, 3, 3, 3, 3], 'value': [0, 0, 0, 0, 0, 0, 0, 0.3, 0, 0, 0, 0]} )` – Zarif Jul 04 '19 at 13:52
  • 1
    @Zarif got it. you have to do the same under the groupby for ID as I did for Value column: `data[data.groupby([data.id.ne(data.id.shift()).cumsum(),data.value.ne(data.value.shift()).cumsum()])['value'].transform(f).ge(3)]` – anky Jul 04 '19 at 14:02
  • @anky This post was really helpful! How would I have to adjust the code to return any sequence where the Value repeats more than 5 time consecutively (and until the end of the sequence)? I haven't been able to figure this out. Changing ge() did not do the trick for me unfortunately. – BlackPearl Jun 02 '22 at 19:33
1

First assemption is that the IDs are sorted.

steps :

1- Sort dataframe.

2- Get index col into a new column to test on consecutivness.

3- split DataFrame into multiple dataframes based on the tuple (id, Value).

4- loop over all dataframes (which is not resource consuming).

5- Match conditions and get ID.

import pandas 

df = pandas.DataFrame({'id': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3], 
                       'value': [0.5, 0, 0, 0, 0.1, 0, 0, 0.3, 0, 0]}
                        )


df.sort_values(by=['id']).reset_index(drop=True)
df['cons'] = df.index
CONST_VALUE = 0

d = dict(tuple(df.groupby(['id', 'value'])))

def is_consecutive(list_):
    setl = set(list_)
    return len(list_) == len(setl) and setl == set(range(min(list_), max(list_)+1))

for k, v in d.items(): 
    if (k[1]==CONST_VALUE and len(v)>=3 and is_consecutive(v['cons'].to_list())): 
        print('wanted ID : {}'.format(k[0]))

Output:

wanted ID : 1
Kaies LAMIRI
  • 199
  • 1
  • 8
1

This is not a trivial problem and requires double groupby similar to @anky91's solution:

# a little different df
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3],
 'Value': [0.0, 0.5, 0.5, 0.0, 0.0, 0.0, 0.0, 0.5, 0.2, 0.1, 0.0, 0.0, 0.0]})

# we want to know where the differences in Value happen
s = df.groupby('ID').Value.transform(lambda x: x.ne(x.shift(-1)) )

# groupby ID and these differences block
# cumsum helps isolate these blocks
idx = s.groupby([df['ID'], s.cumsum()]).cumcount().eq(2)

Gives:

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
dtype: bool

And you can extract the ID's needed by

df.loc[idx, 'ID'].unique()
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

Not the best way but:

>>> df2 = df.groupby('ID').apply(lambda x: [i for i in (x['Value'] != x['Value'].shift()).cumsum().tolist() if (x['Value'] != x['Value'].shift()).cumsum().tolist().count(i) >= 3]).reset_index()
>>> df2.loc[df2.astype(str)[0] != '[]', 'ID'].tolist()
[1]
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114