0

I would like to reenumerate rows in given df using some conditions. My question is an extension of this question.

Example of df:

   ind  seq  status 
0   1   2    up
1   1   3    mid
2   1   5    down
3   2   1    up
4   2   2    mid
5   2   3    down
6   3   1    up
7   3   2    mid
8   3   3    oth 

The df contains ind column which represents a group. The seq column might have some bad data. That's way I would like to add another column seq_corr to correct the seq enumerating based on some conditions:

  • the first value in a group in status column equals up
  • the last value in a group in status column equals down OR oth
  • in all other cases copy number from seq column.

I know the logical way to do this but I have some troubles how to convert it to Python. Especially when it comes to proper slicing and accessing the first and the last element of each group.

Below you can find my not working code:

 def new_id(x):
    if (x.loc['status',0] == 'up') and ((x.loc['status',-1]=='down') or (x['status',-1]=='oth')):
        x['ind_corr'] = np.arange(1, len(x) + 1)
    else:
        x['seq_corr']= x['seq']
    return x

 df.groupby('ind', as_index=False).apply(new_id)

Expected result:

   ind  seq  status  seq_corr
0   1   2    up       1
1   1   3    mid      2
2   1   5    down     3
3   2   1    up       1
4   2   2    mid      2
5   2   3    down     3
6   3   5    up       1
7   3   2    mid      2
8   3   7    oth      3

Hoping that someone would be able to point me out any solution.

cs95
  • 379,657
  • 97
  • 704
  • 746
Michal
  • 1,927
  • 5
  • 21
  • 27

3 Answers3

2

Let's try df.groupby followed by an apply and concatenation.

vals = df.groupby('ind').apply(
       lambda g: np.where(g['status'].iloc[0] == 'up' 
                       or g['status'].iloc[-1] in {'down', 'oth'},
      np.arange(1, len(g) + 1), g['seq'])
).values

df['seq_corr'] = np.concatenate(vals)

df
   ind  seq status  seq_corr
0    1    2     up         1
1    1    3    mid         2
2    1    5   down         3
3    2    1     up         1
4    2    2    mid         2
5    2    3   down         3
6    3    1     up         1
7    3    2    mid         2
8    3    3    oth         3
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
cs95
  • 379,657
  • 97
  • 704
  • 746
1

Another method with groupby cumcount. To select the first and last row we can use head and tail method and union their indices. I think this one might help with your second question

df['seq_corr'] = df.groupby('ind').cumcount()+1
idx = df.groupby('ind').head(1).index.union(df.groupby('ind').tail(1).index)

df.loc[idx,'seq_corr'] = np.where(~df.loc[idx,'status'].isin(['up','down','oth']),
                                    df.loc[idx,'seq'],df.loc[idx,'seq_corr'])

Sample Output:

   ind  seq status  seq_corr
0    1    2     up         1
1    1    3    mid         2
2    1    5  dance         5
3    2    1     up         1
4    2    2    mid         2
5    2    3   down         3
6    3    1     up         1
7    3    2    mid         2
8    3    3    oth         3
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • Thank you @Bharath shetty. Does `isin([...])` takes into account all conditions with `AND` statement ? My idea was to also have `OR`. But anyway I have merged @cᴏʟᴅsᴘᴇᴇᴅ answer with my idea and it looks like everything works fine. I will post the solution below. – Michal Sep 16 '17 at 14:16
  • This will work fine based on your `OR` condition, since we are taking first and last row of grouped data. This was the desired output right? Any problem you faced? – Bharath M Shetty Sep 16 '17 at 19:09
0

Thanks to @cᴏʟᴅsᴘᴇᴇᴅ I have corrected my code. Looking at first tests, everything is ok.

 def new_id(x):
    if (x['status'].iloc[0] == 'up') and ((x['status'].iloc[-1]=='down') or (x['status'].iloc[-1]=='oth')):
    x['seq_corr'] = np.arange(1, len(x) + 1)
    else:
    x['seq_corr']= x['seq']
    return x
Michal
  • 1,927
  • 5
  • 21
  • 27