How to select rows in a DataFrame based on every transition for particular values in a particular column?

Question

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.

I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.

df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()

Given DataFrame:

Index UniqID Value

1 a 1

2 a 0

3 a 1

4 a 0

5 a 1

6 a 2

7 b 0

8 b 2

9 b 1

10 b 2

11 b 0

12 b 1

13 c 0

14 c 1

15 c 2

16 c 2

Expected Output:

2 a 0

3 a 1

4 a 0

5 a 1

6 a 2

9 b 1

10 b 2

11 b 0

12 b 1

13 c 0

14 c 1

15 c 2

Only expecting those rows when there is a transition from either 0-1 or 1-2.

Thank you in advance.

So we can expect rows are sorted by index which makes transition you cited meaningful? — jlandercy, Sep 05 '19 at 07:28
@jlandercy actually yes, i did sorting of id as well of additional timestamp column which gives me the sequence of events, i modified the data and code to remove the complexity of the problem and just concentrate on asking question on a particular part where i am being stuck. — Shashank Singh Yadav, Sep 05 '19 at 07:37
When you say transition is it in both ways 1-2 and 2-1 or only increasing ? — Mayeul sgc, Sep 05 '19 at 08:10
@ShashankSinghYadav - is correct `8 b 2` ? Because there is no pattern `1-2` — jezrael, Sep 05 '19 at 08:45
@jezrael yes ,it is correct as the values column does not follow any particular patter, its just we have to capture a transition (0-1) or (1-2) if present in each unique id seprately. — Shashank Singh Yadav, Sep 05 '19 at 09:12
@ShashankSinghYadav - Not understand, why is necessary index=8 if no match `1-2`, `0-1` pattern? Rows with no matching are not removed? I think test per groups. — jezrael, Sep 05 '19 at 09:16
@jezrael extremely sorry, my bad. You are right ,there is no need for index 8 , i have edited the expected part. — Shashank Singh Yadav, Sep 05 '19 at 09:27
I don't get why, if there is no pattern 1-0, the lines 3 and 4 are not affected. — Mayeul sgc, Sep 05 '19 at 09:51
@Mayeulsgc instead look index 2,3 having (0-1) and index 4,5 having(0-1) too. — Shashank Singh Yadav, Sep 05 '19 at 09:54

jezrael · Answer 1 · 2019-09-06T08:56:51.923

2

Use this my solution working for groups with tuples of patterns:

np.random.seed(123)

N = 100
d = {
    'UniqID': np.random.choice(list('abcde'), N),
    'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)

pat = [(0, 1), (1, 2)]

a = np.array(pat)

s = (df.groupby('UniqID')['Value']
       .rolling(2, min_periods=1)
       .apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))

mask = (s.mask(s == 0)
         .groupby(level=0)
         .bfill(limit=1)
         .fillna(0)
         .astype(bool)
         .reset_index(level=0, drop=True))

df = df[mask]

print (df)
   UniqID  Value
99      a      1
98      a      2
12      a      1
63      a      2
38      a      0
41      a      1
9       a      1
72      a      2
64      b      1
67      b      2
33      b      0
68      b      1
57      b      1
71      b      2
10      b      0
8       b      1
61      c      1
66      c      2
46      c      0
0       c      1
40      c      2
21      d      0
74      d      1
15      d      1
85      d      2
6       d      1
88      d      2
91      d      0
83      d      1
4       d      1
34      d      2
96      d      0
48      d      1
29      d      0
84      d      1
32      e      0
62      e      1
37      e      1
55      e      2
16      e      0
23      e      1

edited Sep 06 '19 at 08:56

answered Sep 05 '19 at 08:43

jezrael

822,522
95
1,334
1,252

you're missing line 8 – Mayeul sgc Sep 05 '19 at 08:44
@jezrael it is failing for large data (when i applied your solution , i got value sequence 1,2,1,1,0,1) ..here 1,1 was not required. – Shashank Singh Yadav Sep 05 '19 at 10:26
@jezrael To be more precise, a part of original data was (0,1,2,1,2) and output received by apply your method was (0,1,2,2)......similarly for (1,0,1,0,1,2) ....i gave (1,1,0,1,2), this is places i found fault while checking. That means it is excluding some rows which it shouldnt. – Shashank Singh Yadav Sep 05 '19 at 12:18
@jezrael yes, they are in same group. – Shashank Singh Yadav Sep 05 '19 at 12:27
@jezrael i mean to say that at some places when the value column had 0,1,0,1 ...the code returned 0,1,1 (skipping one 0), similarly, if somewhere 1,2,1,2 ....the code returned (1,2,2). – Shashank Singh Yadav Sep 06 '19 at 06:52
@jezrael I have noticed that whenever there is a sequence 1,2,1,2......your code removes the 2nd "1"....and gives 1,2,2. So, many place in output i got 2,2. – Shashank Singh Yadav Sep 06 '19 at 08:44
@ShashankSinghYadav - There are some `1,1` values, but only if `0-1` and then `1-2` in groups d and e – jezrael Sep 06 '19 at 09:01
@jezrael its strange, your result look perfect.....for me i am getting 2,2. – Shashank Singh Yadav Sep 06 '19 at 09:05
@ShashankSinghYadav - hmmm, are data confidental? Can I check it? – jezrael Sep 06 '19 at 09:07
@ShashankSinghYadav - one idea - if use `df = df.reset_index(drop=True)` for unique index before my solution it should help. – jezrael Sep 06 '19 at 09:31

Parth · Accepted Answer · 2019-09-11T08:24:34.077

1

Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)

Similar Sample data:

index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2

Load this in pandas dataframe. Then, Using below code:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])

It gives desired dataframe based on assumption:

     index id  value
1       2  a      0
2       3  a      1
3       4  a      0
4       5  a      1
5       6  a      2
8       9  b      1
9      10  b      2
10     11  b      0
11     12  b      1
12     13  c      0
13     14  c      1
14     15  c      2

Edit: To include transition 1->0, below is updated function:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
    index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
    return pd.DataFrame(index1+index2)

edited Sep 11 '19 at 08:24

answered Sep 05 '19 at 08:40

Parth

644
4
10

when taken a larger data set, if value in one id ends with 0 and another id start with 1, than your solution combine them both and take both the rows even if they are from different id. – Shashank Singh Yadav Sep 06 '19 at 05:28
@ShashankSinghYadav Updated answer. Please check now. – Parth Sep 06 '19 at 06:28
still having problem, for example ...it is taking 0 for id A and 1 from id B. But according to our requirement, the code should remove them as they are not forming a pair. – Shashank Singh Yadav Sep 06 '19 at 07:05
@ShashankSinghYadav If you can give example/sample data for case where you are facing problem, it would be faster to reproduce for solving and scope of misunderstanding could be reduced. (Give all 3 columns for case you are trying to describe here.) – Parth Sep 06 '19 at 07:39
@ShashankSinghYadav Expected output Index -> 7 and 8 belongs to different group 7 has Value 1 and 8 has value 2, but this transition is invalid as they belong to different group. Please check validity of output row index you have shared, does not seem right as per requirement. – Parth Sep 06 '19 at 08:45
@ShashankSinghYadav Even expected output index should include 6th Index, as there is valid transition from 5 to 6 Index. Please check this as well. – Parth Sep 06 '19 at 08:51
Still Index 7 and 8 belongs to different group and transition is invalid. – Parth Sep 06 '19 at 09:12
Extremely sorry..had some typo error...please check (index= 1,2,3,4,5,6,7,8,9,10,11,12,13 id= a,a,b,b,b,b,b,c,c,c, c, d, d value= 1,0,1,0,1,2,1,2,0,1, 0, 1, 0 expected : 4,5,6,9,10) – Shashank Singh Yadav Sep 06 '19 at 09:17
@ShashankSinghYadav Okay, then given code returns perfect Index => (4,5,6,9,10). Why at your end it is not working? It is giving perfect output to me. – Parth Sep 06 '19 at 09:19
after i corrected a little problem with index, your code worked fine. Thanks. – Shashank Singh Yadav Sep 06 '19 at 10:29
Great! Glad to help. – Parth Sep 06 '19 at 10:44
Just out of curosity. If i want to include (1,0) too. How we gonna do it? – Shashank Singh Yadav Sep 10 '19 at 04:48
Updated answer for (1,0). If you want more transition to work with, answer by @jezrael is more generalized. You can go ahead with that in case of more transition. – Parth Sep 11 '19 at 08:26

score 1 · Answer 3 · answered Sep 05 '19 at 08:45

My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2

df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
    df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)

Actually waiting for updates about the 2-1 and 1-2 relationship

How to select rows in a DataFrame based on every transition for particular values in a particular column?

3 Answers3