Grouping dataframe based on consecutive occurrence of values

Question

I have a pandas array which has one column which is either true or false (titled 'condition' in the example below). I would like to group the array by consecutive true or false values. I have tried to use pandas.groupby but haven't succeeded using that method, albeit I think that's down to my lack of understanding. An example of the dataframe can be found below:

df = pd.DataFrame(df)
print df
print df
index condition   H  t
0          1  2    1.1
1          1  7    1.5
2          0  1    0.9
3          0  6.5  1.6
4          1  7    1.1
5          1  9    1.8
6          1  22   2.0

Ideally the output of the program would be something along the lines of what can be found below. I was thinking of using some sort of 'grouping' method to make it easier to call each set of results but not sure if this is the best method. Any help would be greatly appreciated.

index condition   H  t group
0          1  2    1.1  1
1          1  7    1.5  1
2          0  1    0.9  2
3          0  6.5  1.6  2
4          1  7    1.1  3 
5          1  9    1.8  3
6          1  22   2.0  3

jezrael · Answer 1 · 2017-12-12T07:59:08.120

Compare with ne (!=) by shifted column and then use cumsum:

df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
print (df)
       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

Detail:

print (df['condition'].ne(df['condition'].shift()))
index
0     True
1    False
2     True
3    False
4     True
5    False
6    False
Name: condition, dtype: bool

Timings:

df = pd.concat([df]*100000).reset_index(drop=True)


In [54]: %timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 12.2 ms per loop

In [55]: %timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1
10 loops, best of 3: 24.5 ms per loop

In [56]: %%timeit
    ...: df['group'] = 1
    ...: df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
    ...: 
10 loops, best of 3: 26.6 ms per loop

@cᴏʟᴅsᴘᴇᴇᴅ - Thank you. – jezrael Dec 12 '17 at 08:21 — jezrael, Dec 12 '17 at 08:21

cs95 · Accepted Answer · 2017-12-12T08:15:02.650

3

Since you're dealing with 0/1s, here's another alternative using diff + cumsum -

df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1    
df

       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

If you don't mind floats, this can be made a little faster.

df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1
df

   index  condition     H    t  group
0      0          1   2.0  1.1    1.0
1      1          1   7.0  1.5    1.0
2      2          0   1.0  0.9    2.0
3      3          0   6.5  1.6    2.0
4      4          1   7.0  1.1    3.0
5      5          1   9.0  1.8    3.0
6      6          1  22.0  2.0    3.0

Here's the version with numpy equivalents -

df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1
df


       condition     H    t  group
index                             
0              1   2.0  1.1      1
1              1   7.0  1.5      1
2              0   1.0  0.9      2
3              0   6.5  1.6      2
4              1   7.0  1.1      3
5              1   9.0  1.8      3
6              1  22.0  2.0      3

On my machine, here are the timings -

df = pd.concat([df] * 100000, ignore_index=True)

%timeit df['group'] = df.condition.diff().abs().cumsum().fillna(0).astype(int) + 1 
10 loops, best of 3: 25.1 ms per loop

%%timeit
df['group'] = df.condition.diff().abs().cumsum() + 1
df.loc[0, 'group'] = 1

10 loops, best of 3: 23.4 ms per loop

%%timeit
df['group'] = 1
df.loc[1:, 'group'] = np.cumsum(np.abs(np.diff(df.condition))) + 1

10 loops, best of 3: 21.4 ms per loop

%timeit df['group'] = df['condition'].ne(df['condition'].shift()).cumsum()
100 loops, best of 3: 15.8 ms per loop

edited Dec 12 '17 at 08:15

answered Dec 12 '17 at 07:49

cs95

379,657
97
704
746

@jezrael Okay sorry, yes I changed it. The numpy version is faster on my computer. – cs95 Dec 12 '17 at 08:15
This worked great for me. I used the first example as I want to keep my array as a pandas data frame. Thanks for the help, I really appreciate it – Rodrigo Prodohl Dec 12 '17 at 08:15
So performance is not important? I am a bit surprised `diff` slowier solution is better as shift... – jezrael Dec 12 '17 at 08:16
@RodrigoProdohl You're welcome. If you find my answer is too slow, take another look at jezraels' solution though our answers will perform almost equal on smaller data. – cs95 Dec 12 '17 at 08:17
@RodrigoProdohl - but it is up you what solution use, sure ;) – jezrael Dec 12 '17 at 08:18
@jezrael From my experience, [not always](https://stackoverflow.com/a/47545491/4909087)! – cs95 Dec 12 '17 at 08:18
@cᴏʟᴅsᴘᴇᴇᴅ - yes, but maybe OP dont want numpy complicated solution, so use slowier and easier one in link. – jezrael Dec 12 '17 at 08:20
Exactly, it depends on OP. – cs95 Dec 12 '17 at 08:21
1

Both solutions worked great and I really appreciate all the responses I get on my questions so thank you both :) – Rodrigo Prodohl Dec 12 '17 at 08:26

Grouping dataframe based on consecutive occurrence of values

2 Answers2

Linked