0

I have data dataframe where in which I want to add column which will indicate the order of the of the date for each user_id separately, so it will be like in TABLE2.

TABLE1:

user_id         received_at          action    
0043e1a6-52e4   2018-01-05 12:32:10  A                 
                2018-01-05 12:33:13  A                 
                2018-01-05 12:42:12  B                 
0070f782-29f4   2018-01-06 01:41:18  A                 
                2018-01-06 01:42:12  A                 
                2018-01-06 01:43:11  B                 
                2018-01-06 01:44:18  C                               
008aa58a-84a5   2018-01-06 14:22:13  A                 
                2018-01-06 14:23:18  A                 
                2018-01-06 14:24:13  A                 
                2018-01-06 14:25:18  C 

to something like this

TABLE2:

user_id         received_at          action  order_n 
0043e1a6-52e4   2018-01-05 12:32:10  A       1               
                2018-01-05 12:33:13  A       2               
                2018-01-05 12:42:12  B       3               
0070f782-29f4   2018-01-06 01:41:18  A       1               
                2018-01-06 01:42:12  A       2               
                2018-01-06 01:43:11  B       3               
                2018-01-06 01:44:18  C       4                             
008aa58a-84a5   2018-01-06 14:22:13  A       1               
                2018-01-06 14:23:18  A       2               
                2018-01-06 14:24:13  A       3               
                2018-01-06 14:25:18  C       4  

Are there are more efficient and maybe simpler solution than doing iteration for each user_id separately?

Somew
  • 3
  • 4

1 Answers1

0

groupby user_id and get the rank using received_at

df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank)

This doesn't require a sorting step & will assign the correct rank even if the data frame is not sorted by received_at within each group

if the column user_id is set as an index (as your sample data seems to indicate), you could alternative use the following instead. Although, in recent versions of pandas, grouping by named indexes also works (i.e. the above might work)

df.groupby(level=0).received_at.apply(pd.Series.rank)
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • Thanks, your first solution is working for this example. However, in bigger sample it provides float 'count_n' if there are two different 'action' in one 'received_at'. Adding flag solves this problem: df['count_n'] = df.groupby('user_id').received_at.apply(pd.Series.rank, **method='first'** ) – Somew Sep 05 '18 at 04:48
  • @Somew, glad you were able to sort out the issue – Haleemur Ali Sep 05 '18 at 05:27