group the same consecutive values in pandas and store: values, indices, and column slices

Question

I have a dataframe

import pandas as pd
import numpy as np
v1=list(np.random.rand(30))
v2=list(np.random.rand(30))
mydf=pd.DataFrame(data=zip(v1,v2),columns=['var1','var2'])

then I apply some boolean conditions on some variables

mydf['cond1']=mydf['var1']>0.2
mydf['cond2']=mydf['var1']>0.8


mydf['cond1']=
0 False
1 True
2 True
3 False
4 False
5 True
6 False
....

I would like to group in blocks where 'cond1' (or 'cond2') is True, and for each group store:

the value of the group: True/False
the index of the start, and of the end, of the block: e.g. 1,2 5,5
the 2 values of var2 at index of the start, and of the end,
all the values of var1 between the index of the start, and of the end, as an iterable (list of np.array)

this is one example of returned values:

summary=
'Start' 'End' 'Start_var2' 'End_var2' 'Value' 'var1'
 1        2    0.3217381    0.454543   True    [0.25,0.26]

score 1 · Answer 1 · answered Jan 25 '18 at 15:01

1

I think you can use this SO answer. i gives you the group number, and the index of g can be used to get the var values.

v1=list(np.random.rand(30))
v2=list(np.random.rand(30))
df=pd.DataFrame(data=zip(v1,v2),columns=['var1','var2'])

df['cond1']=df['var1']>0.2
df['cond2']=df['var1']>0.8

for i, g in df.groupby([(df['cond1'] != df['cond1'].shift()).cumsum()]):
    print (i)
    print (g)
    print (g['cond1'].tolist())
    print(g['cond1'].index[0])#can get var values from this

answered Jan 25 '18 at 15:01

Antoine Zambelli

724
7
19

that's very close. any chance of avoiding the for loop?it is a long df – 00__00__00 Jan 25 '18 at 15:02
1

I doubt it, `groupby` returns an object that you're gonna have to unpack somehow. I don't consider myself an expert though. The person who answered the linked question might be able to help though. – Antoine Zambelli Jan 25 '18 at 15:07

score 1 · Accepted Answer · answered Jan 25 '18 at 15:21

IIUC, Let's try something like this:

mydf.groupby(mydf.cond1.diff().cumsum(), as_index=False)\
    .apply(lambda x: pd.Series([x.iloc[0].name,
                                x.iloc[-1].name, 
                                x.iloc[0]['var2'], 
                                x.iloc[-1]['var2'], 
                                x.iloc[0]['cond1'], 
                                x.var1.tolist()],
                                index=['Start','End','Start_var2',
                                       'End_var2','Value','var1']))

Output:

   Start  End  Start_var2  End_var2  Value                                               var1
0      1   13    0.580713  0.772878   True  [0.9080110836630401, 0.34879731608699105, 0.63...
1     14   14    0.688374  0.688374  False                              [0.11739843719148924]
2     15   15    0.204304  0.204304   True                               [0.3010533582011998]
3     16   17    0.470689  0.808964  False         [0.14526373397045378, 0.09218609736837002]
4     18   20    0.675035  0.087408   True  [0.6029321967069232, 0.3641874497564469, 0.564...
5     21   21    0.346795  0.346795  False                               [0.1913357207205566]
6     22   29    0.944366  0.845753   True  [0.6769058596527606, 0.2155054472756598, 0.278...

perfect.nice, compact and without explict loops – 00__00__00 Jan 25 '18 at 18:06 — 00__00__00, Jan 25 '18 at 18:06

group the same consecutive values in pandas and store: values, indices, and column slices

2 Answers2