0

I have a table as below :

User_ID Cricket Football    Chess   Video_ID    Category    Time
1       200     150         100     111          A          Morning
1       200     150         100     222          B          Morning
1       200     150         100     111          A          Afternoon
1       200     150         100     333          A          Morning
2       100     160         80      444          C          Evening
2       100     160         80      222          C          Evening
2       100     160         80      333          A          Morning
2       100     160         80      333          A          Morning

Above table is a transactional table, each entry represents the transaction of a user watching a video.

For Eg. “User_ID” - 1 has watched video’s 4 times.

What all video’s watched are given in “Video_ID” : 111,222,111,333

NOTE : Video_ID - 111 was watched twice by this user.

Cricket, Football, Chess : The values are duplicate for each row. (I.e) No of times “User_ID” 1 played cricket , football, chess are 200,150,100. ( They are duplicate in other rows for that particular “User_ID”.

Category : Which Category that particular Video_ID belongs to. Time : What time the Video_ID was watched.

I am trying to get the below information from the table :

User_ID Top_1_Game  Top_2_Game  Top_1_Cat   Top_2_Cat   Top_Time
1       Cricket     Football    A    B  Morning
2       Football    Cricket     C    A  Evening

NOTE : If the count of Category is same then any one can be kept as Top_1_Category.

Its bit complex though, can anyone help on this ?

pyds_learner
  • 509
  • 4
  • 16

1 Answers1

1

First get top values per groups by User_ID and Video_ID with Series.value_counts and index[0]:

df1 = df.groupby(['User_ID','Video_ID']).agg(lambda x: x.value_counts().index[0])

Then get second top Category by GroupBy.nth:

s = df1.groupby(level=0)['Category'].nth(1)

Remove duplicates by User_ID with DataFrame.drop_duplicates:

df1 = df1.reset_index().drop_duplicates('User_ID').drop('Video_ID', axis=1)
cols = ['User_ID','Category','Time']
cols1 = df1.columns.difference(cols)

Get top2 games by this solution:

df2 = pd.DataFrame((cols1[np.argsort(-df1[cols1].values, axis=1)[:,:2]]),
                    columns=['Top_1_Game','Top_2_Game'],
                    index=df1['User_ID'])

Filter Category and Time with rename columns names:

df3 = (df1[cols].set_index('User_ID')
               .rename(columns={'Category':'Top_1_Cat','Time':'Top_Time'}))

Join together by DataFrame.join and DataFrame.insert Top_2_Cat values:

df = df2.join(df3).reset_index()
df.insert(4, 'Top_2_Cat', s.values)
print (df)
   User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
0        1    Cricket   Football         A         B  Morning
1        2   Football    Cricket         C         A  Evening
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the answer, when i m trying to follow ur code the first line gives me the error : IndexError: index 0 is out of bounds for axis 0 with size 0 – pyds_learner Apr 03 '19 at 08:36
  • when tested with my sample data of 50 rows i am getting "Top time" ,"Top_1_Game" , "Top_2_Game" correct but the Top_1_Cat and Top_2_Cat are getting it wrong... For Categories, we have group by User id and get value counts and which where is top 2 display them according. But its not working. – pyds_learner Apr 03 '19 at 09:07