1

I have a dataframe which have 4 column User_id,Transaction_id,product and datetime. For each user i have to select his top n recent transaction,Let assume n=2, My dataframe is like:

    transaction_id  user_id  product  date
         T1             U1     P1     2019-03-27
         T1             U1     P2     2019-03-27
         T1             U1     P3     2019-03-27
         T2             U1     P2     2019-03-21
         T2             U1     P3     2019-03-21
         T3             U1     P2     2019-03-20

I tried to do it by taking help of this group by pandas dataframe and select latest in each group

What i am expecting as a output is:

   transaction_id   user_id  product  date
        T1            U1       P1     2019-03-27
        T1            U1       P2     2019-03-27
        T1            U1       P3     2019-03-27
        T2            U1       P2     2019-03-21
        T2            U1       P3     2019-03-21
sourav khanna
  • 191
  • 2
  • 15
  • Where is the question? What did you try so far? – Guybrush Mar 27 '19 at 09:48
  • 1
    df.sort_values('date',ascending = False).groupby(['order_id','user_id']).head(2) This is what i have tried but the problem with this code is it will give me top n transaction within one transaction – sourav khanna Mar 27 '19 at 09:54

1 Answers1

1

Idea is remove duplicates first by DataFrame.drop_duplicates, get top2 values per groups and DataFrame.merge original DataFrame:

df = (df.merge(df.drop_duplicates(['user_id','date'])
                 .sort_values('date',ascending = False)
                 .groupby('user_id')
                 .head(2)[['user_id','date']])
       )
print (df)
  transaction_id user_id product       date
0             T1      U1      P1 2019-03-27
1             T1      U1      P2 2019-03-27
2             T1      U1      P3 2019-03-27
3             T2      U1      P2 2019-03-21
4             T2      U1      P3 2019-03-21
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252