I have sample pandas dataframe as
sample_df = pd.DataFrame({
'visitor_id' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'pagename' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'order_status' : ['accept', 'accept', 'accept', 'reject', 'reject', 'reject', 'reject', 'reject', 'reject'],
})
visitor_id pagename order_status
MM1 S1 accept
MM1 S1 accept
MM1 S3 accept
MM2 S3 reject
MM2 S4 reject
MM2 S4 reject
MM4 S2 reject
MM4 S2 reject
MM4 S2 reject
I am trying to aggregate journeys based on pagenames for each visitor_id. I have this code to build the journeys -
sample_journeys = sample_df.groupby(['visitor_id']) \
.apply(lambda x: ','.join(x.pagename)) \
.reset_index()
sample_journeys.columns = ['visitor_id', 'page_paths']
It gives the output as
visitor_id page_paths
0 MM1 S1,S1,S3
1 MM2 S3,S4,S4
2 MM4 S2,S2,S2
I also want to take a max
of order_status
column for every visitor_id in my aggregation operation. So the output should have 3 columns -
visitor_id page_paths max_order_status
MM1 S1,S1,S3 accept
MM2 S3,S4,S4 reject
MM4 S2,S2,S2 reject
How do I perform both the aggregations in 1 go in Pandas ?
I tried something like
sample_journeys = sample_df.groupby(['visitor_id']) \
.apply(lambda x: ','.join(x.pagename)) \
.agg({'order_status': ['max']}) \
.reset_index()
but it fails with some error - unexpected character after line continuation character
I also want to rename my final dataframe with column names as [visitor_id, page_paths, max_order_status]
after aggregation, which I find it difficult.