Efficient Way of Updating Dataframe Columns

Question

I have two dataframes: let's call them group_user_log and group_user

group_user_log

user_id  server_time  session_id  

1           2018-01-01   435
1           2018-01-01   435
1           2018-01-04   675
1           2018-01-05   454
1           2018-01-05   454
1           2018-01-06   920 


group_train 

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      0            0
1         2018-01-05      0            0

Logic is to pull total and distinct count of session_id from group_user_log where server_time is less than impression_time and populate the total and distinct count columns. Expected output for group_train is:

user_id  impression_time  totalcount  distinct_count
1         2018-01-03      2               1
1         2018-01-05      3               2

I tried doing it row-by-row but that is time consuming and very inefficient for larger dataframes because above data is a subset for a particular user_id from two large dataframes and such calculation needs to be done for a large number of user_id so I am looking to make it efficient.

Thanks for your help!!

Possible duplicate of [Pandas aggregate count distinct](https://stackoverflow.com/questions/18554920/pandas-aggregate-count-distinct) — G. Anderson, Aug 29 '19 at 16:27

anky · Accepted Answer · 2019-08-30T06:42:05.873

With groupby , merge and query:

#merge on user_id and query for server_time<impression_time
m=group_user_log.merge(group_train,on='user_id').query('server_time<impression_time')
#groupby on user_id and impression_time and agg on size and nunique
(m.groupby(['user_id','impression_time'])['session_id'].agg(['size','nunique'])
   .rename(columns={'size':'totalcount','nunique':'distinct_count'}))

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2

You can then use this to update the group_train by setting user_id and impression_time as index:

group_train=group_train.set_index(['user_id','impression_time'])
group_train.update(m)
print(group_train) #.reset_index()

                         totalcount  distinct_count
user_id impression_time                            
1       2018-01-03                2               1
        2018-01-05                3               2

Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198705/discussion-between-chandanjha-and-anky-91). — ChandanJha, Aug 30 '19 at 07:54

Efficient Way of Updating Dataframe Columns

1 Answers1