I am doing query analysis of search engine. User may search different query one by one on google search engine at different time in one session.
I have data with several field: session_id
, log_time
, query
, feature_i
, etc. I want to group by session_id
and then concat
several rows into one by the order of log_time
. So that output data will represent user's behaviors in a time series way.
dataset
Code:
toy_data = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
print(toy_data)
Output:
session_id log_time query cate_feat_0 num_feat_0
0 1 4 hi apple 1
1 2 5 dude banana 2
2 1 6 pandas apple 3
3 2 1 groupby banana 4
4 3 2 sort apple 5
5 3 3 agg banana 6
What I want:
## note that all list are sorted by log time with each session_id group
session_id query_list log_time_list cate_feat_0_list num_feat_0_list
1 [hi, pandas] [4,6] [apple, apple] [1,3]
2 [groupby, dude] [1,5] [banana, banana] [4,2]
3 [sort,agg] [2,3] [apple, banana] [5,6]
My attempt
First we groupby and agg with code:
toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res
Gives:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
Then we sort with in each session with code:
for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res
Gives:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
My approach is quick slow. Is there any better way to do groupby -> sort with in group -> aggregation
?
Tips:
We can use STRING_AGG
or GROUP_CONCAT
in SQL to do within group sorting.