2

I have a DataFrame that has a ID column and Value column that only consist (0,1,2). I want to capture only those rows, if there is a transition from (0-1) or (1-2) in value column. This process has to be done for each ID separately.

I tried to do the groupby for ID and using a difference aggregation function. So that i can take those rows for which difference of values is 1. But it is failing in certain condition.

df=df.loc[df['values'].isin([0,1,2])]
df = df.sort_values(by=['Id'])
df.value.diff()

Given DataFrame:

Index   UniqID   Value

1    a    1

2    a    0

3    a    1

4    a    0

5    a    1

6    a    2

7    b    0

8    b    2

9    b    1

10    b    2

11    b    0

12    b    1

13    c    0

14    c    1

15    c    2

16    c    2

Expected Output:

2    a    0

3    a    1

4    a    0

5    a    1

6    a    2

9    b    1

10    b    2

11    b    0

12    b    1

13    c    0

14    c    1

15    c    2

Only expecting those rows when there is a transition from either 0-1 or 1-2.

Thank you in advance.

3 Answers3

2

Use this my solution working for groups with tuples of patterns:

np.random.seed(123)

N = 100
d = {
    'UniqID': np.random.choice(list('abcde'), N),
    'Value': np.random.choice([0,1,2], N),
}
df = pd.DataFrame(d).sort_values('UniqID')
#print (df)

pat = [(0, 1), (1, 2)]

a = np.array(pat)

s = (df.groupby('UniqID')['Value']
       .rolling(2, min_periods=1)
       .apply(lambda x: np.all(x[None :] == a, axis=1).any(), raw=True))

mask = (s.mask(s == 0)
         .groupby(level=0)
         .bfill(limit=1)
         .fillna(0)
         .astype(bool)
         .reset_index(level=0, drop=True))

df = df[mask]

print (df)
   UniqID  Value
99      a      1
98      a      2
12      a      1
63      a      2
38      a      0
41      a      1
9       a      1
72      a      2
64      b      1
67      b      2
33      b      0
68      b      1
57      b      1
71      b      2
10      b      0
8       b      1
61      c      1
66      c      2
46      c      0
0       c      1
40      c      2
21      d      0
74      d      1
15      d      1
85      d      2
6       d      1
88      d      2
91      d      0
83      d      1
4       d      1
34      d      2
96      d      0
48      d      1
29      d      0
84      d      1
32      e      0
62      e      1
37      e      1
55      e      2
16      e      0
23      e      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Assuming, transition is strictly from 1 -> 2 and 0 -> 1. (This assumption is valid as well.)

Similar Sample data:

index,id,value
1,a,1
2,a,0
3,a,1
4,a,0
5,a,1
6,a,2
7,b,0
8,b,2
9,b,1
10,b,2
11,b,0
12,b,1
13,c,0
14,c,1
15,c,2
16,c,2

Load this in pandas dataframe. Then, Using below code:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    return pd.DataFrame(list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index']))
target_index=df.groupby('id').apply(lambda x:grp_trns(x)).values.squeeze()
print(df[df['index'].isin(target_index)][['index', 'id','value']])

It gives desired dataframe based on assumption:

     index id  value
1       2  a      0
2       3  a      1
3       4  a      0
4       5  a      1
5       6  a      2
8       9  b      1
9      10  b      2
10     11  b      0
11     12  b      1
12     13  c      0
13     14  c      1
14     15  c      2

Edit: To include transition 1->0, below is updated function:

def grp_trns(x):
    x['dif']=x.value.diff().fillna(0)
    index1=list(x[x.dif==1]['index']-1)+list(x[x.dif==1]['index'])
    index2=list(x[(x.dif==-1)&(x.value==0)]['index']-1)+list(x[(x.dif==-1)&(x.value==0)]['index'])
    return pd.DataFrame(index1+index2)
Parth
  • 644
  • 4
  • 10
  • when taken a larger data set, if value in one id ends with 0 and another id start with 1, than your solution combine them both and take both the rows even if they are from different id. – Shashank Singh Yadav Sep 06 '19 at 05:28
  • @ShashankSinghYadav Updated answer. Please check now. – Parth Sep 06 '19 at 06:28
  • still having problem, for example ...it is taking 0 for id A and 1 from id B. But according to our requirement, the code should remove them as they are not forming a pair. – Shashank Singh Yadav Sep 06 '19 at 07:05
  • @ShashankSinghYadav If you can give example/sample data for case where you are facing problem, it would be faster to reproduce for solving and scope of misunderstanding could be reduced. (Give all 3 columns for case you are trying to describe here.) – Parth Sep 06 '19 at 07:39
  • @ShashankSinghYadav Expected output Index -> 7 and 8 belongs to different group 7 has Value 1 and 8 has value 2, but this transition is invalid as they belong to different group. Please check validity of output row index you have shared, does not seem right as per requirement. – Parth Sep 06 '19 at 08:45
  • @ShashankSinghYadav Even expected output index should include 6th Index, as there is valid transition from 5 to 6 Index. Please check this as well. – Parth Sep 06 '19 at 08:51
  • Still Index 7 and 8 belongs to different group and transition is invalid. – Parth Sep 06 '19 at 09:12
  • Extremely sorry..had some typo error...please check (index= 1,2,3,4,5,6,7,8,9,10,11,12,13 id= a,a,b,b,b,b,b,c,c,c, c, d, d value= 1,0,1,0,1,2,1,2,0,1, 0, 1, 0 expected : 4,5,6,9,10) – Shashank Singh Yadav Sep 06 '19 at 09:17
  • @ShashankSinghYadav Okay, then given code returns perfect Index => (4,5,6,9,10). Why at your end it is not working? It is giving perfect output to me. – Parth Sep 06 '19 at 09:19
  • after i corrected a little problem with index, your code worked fine. Thanks. – Shashank Singh Yadav Sep 06 '19 at 10:29
  • Great! Glad to help. – Parth Sep 06 '19 at 10:44
  • Just out of curosity. If i want to include (1,0) too. How we gonna do it? – Shashank Singh Yadav Sep 10 '19 at 04:48
  • Updated answer for (1,0). If you want more transition to work with, answer by @jezrael is more generalized. You can go ahead with that in case of more transition. – Parth Sep 11 '19 at 08:26
1

My version is using shift and diff() to delete all lines with diff value equal to 0,2 or -2

df = pandas.DataFrame({'index':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],'UniqId':['a','a','a','a','a','a','b','b','b','b','b','b','c','c','c','c'],'Value':[1,0,1,0,1,2,0,2,1,2,0,1,0,1,2,2]})
df['diff']=np.NaN
for element in df['UniqId'].unique():
    df['diff'].loc[df['UniqId']==element]=df.loc[df['UniqId']==element]['Value'].diff()
df['diff']=df['diff'].shift(-1)
df=df.loc[(df['diff']!=-2) & (df['diff']!=2) & (df['diff']!=0)]
print(df)

Actually waiting for updates about the 2-1 and 1-2 relationship

Mayeul sgc
  • 1,964
  • 3
  • 20
  • 35