0

I have sample pandas dataframe as

sample_df = pd.DataFrame({
'visitor_id' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'pagename' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'order_status' : ['accept', 'accept', 'accept', 'reject', 'reject', 'reject', 'reject', 'reject', 'reject'],
})

visitor_id  pagename    order_status
MM1       S1    accept
MM1       S1    accept
MM1       S3    accept
MM2       S3    reject
MM2       S4    reject
MM2       S4    reject
MM4       S2    reject
MM4       S2    reject
MM4       S2    reject

I am trying to aggregate journeys based on pagenames for each visitor_id. I have this code to build the journeys -

sample_journeys = sample_df.groupby(['visitor_id']) \
.apply(lambda x: ','.join(x.pagename)) \
.reset_index() 
sample_journeys.columns = ['visitor_id', 'page_paths']

It gives the output as

    visitor_id  page_paths
0   MM1       S1,S1,S3
1   MM2       S3,S4,S4
2   MM4       S2,S2,S2

I also want to take a max of order_status column for every visitor_id in my aggregation operation. So the output should have 3 columns -

visitor_id  page_paths  max_order_status
MM1       S1,S1,S3      accept
MM2       S3,S4,S4      reject
MM4       S2,S2,S2      reject

How do I perform both the aggregations in 1 go in Pandas ?

I tried something like

sample_journeys = sample_df.groupby(['visitor_id']) \
.apply(lambda x: ','.join(x.pagename)) \
.agg({'order_status': ['max']}) \ 
.reset_index() 

but it fails with some error - unexpected character after line continuation character

I also want to rename my final dataframe with column names as [visitor_id, page_paths, max_order_status] after aggregation, which I find it difficult.

Regressor
  • 1,843
  • 4
  • 27
  • 67

1 Answers1

1

Use:

new_df=( sample_df.groupby('visitor_id',sort=False)
                  .agg(page_paths=('pagename',lambda x: ','.join(x)),
                       max_order_status=('order_status','max')) 
                  .reset_index() )
print(new_df)

Output

  visitor_id page_paths max_order_status
0        MM1   S1,S1,S3           accept
1        MM2   S3,S4,S4           reject
2        MM4   S2,S2,S2           reject
ansev
  • 30,322
  • 5
  • 17
  • 31
  • When i use this, the column names appear in 2 rows. Then when i change ur code to `sample_df.groupby('visitor_id').agg(page_paths=('pagename',lambda x: ','.join(x)),max_order_status=('order_status','max')).reset_index()` it works with all the column names in 1 row – Regressor Nov 06 '19 at 22:19
  • Is this your expected departure. Can you show what you get with my code? :) – ansev Nov 06 '19 at 22:22
  • never mind, i did not see that ur code has `reset_index()`. Thanks for the answer, it works like a charm. What does reset_index() do ? – Regressor Nov 06 '19 at 22:25
  • 1
    I added `reset_index` later. `reset_index` resets the index. The current index becomes a column And the new index is 0,1,2 ... You can see it here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html – ansev Nov 06 '19 at 22:27