pandas groupby and aggregate two columns for respective totals, then calculate ratio - summarize the summary

Question

I have following data

user_id   session_id    youtube_id 
1           1             2342 
1           1             3523
1           2             3325
2           1             3423
2           1             2352
2           1             3333 
2           2             2351
2           2             9876
2           3             2388

Goal is to group by user_id and calculate both total_sessions, total_views per user and hence average views per session.

user_id, total_sessions, total_views, average_view_per_session
1,         2,            3,           1.5
2,         3,            6,           2    

 result_df['avg'] = df.groupby('user_id').agg({
     'session_id':lambda x : x.nunique(),
     'youtube_id': 'count'}).apply(lambda x : x['total_views']/x['total_sessions']

Two problems with above:

the resulting columns are still named session_id and youtube_id though they are aggregations
how to carry out the division to get the average_view_per_session?

The above approach gives a key error which could be due to using the original column name for aggregated columns.

Renaming the columns from an aggregate [Naming returned columns in Pandas aggregate function](https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function) — smci, May 08 '18 at 23:13
So you want to aggregate to get the respective totals, then get their ratio to get the average, i.e. 'summarize' your summary. I tried to reword to make this clearer. — smci, May 08 '18 at 23:17

Anton vBR · Answer 1 · 2018-05-08T23:09:13.320

1

As they depend on each other we could add it on a separate line of code after applying the .agg().

How about this (full example)

import pandas as pd

data = '''\
user_id   session_id    youtube_id 
1           1             2342 
1           1             3523
1           2             3325
2           1             3423
2           1             2352
2           1             3333 
2           2             2351
2           2             9876
2           3             2388'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

df = df.groupby('user_id').agg({'session_id': 'nunique', 'youtube_id': 'count'})
df.columns = ['total sessions','total views']

df['average view per session'] = df['total views']/df['total sessions']

print(df)

Returns:

         total sessions  total views  average view per session
user_id                                                       
1                     2            3                       1.5
2                     3            6                       2.0

edited May 08 '18 at 23:09

answered May 08 '18 at 23:00

Anton vBR

18,287
5
40
46

Missing `from io import StringIO` – smci May 08 '18 at 23:06
1

@smci Thanks, but actually not necessary. Been using `pd.compat.StringIO()` for demo-purposes. It should work for both py2 and py3. – Anton vBR May 08 '18 at 23:07
1

Also don't you mean "a new column" instead of "row"? – smci May 08 '18 at 23:08
@smci I meant a new row (as in a new line of code). It is indeed far-fetched. Edited, thanks. – Anton vBR May 08 '18 at 23:08
Time to sort out which import: [Should we use pandas.compat.StringIO or Python 2/3 StringIO?](https://stackoverflow.com/questions/50283292/should-we-use-pandas-compat-stringio-or-python-2-3-stringio) – smci May 11 '18 at 00:29

pandas groupby and aggregate two columns for respective totals, then calculate ratio - summarize the summary

1 Answers1