1

I have following data

user_id   session_id    youtube_id 
1           1             2342 
1           1             3523
1           2             3325
2           1             3423
2           1             2352
2           1             3333 
2           2             2351
2           2             9876
2           3             2388

Goal is to group by user_id and calculate both total_sessions, total_views per user and hence average views per session.

user_id, total_sessions, total_views, average_view_per_session
1,         2,            3,           1.5
2,         3,            6,           2    

 result_df['avg'] = df.groupby('user_id').agg({
     'session_id':lambda x : x.nunique(),
     'youtube_id': 'count'}).apply(lambda x : x['total_views']/x['total_sessions']

Two problems with above:

  1. the resulting columns are still named session_id and youtube_id though they are aggregations
  2. how to carry out the division to get the average_view_per_session?

The above approach gives a key error which could be due to using the original column name for aggregated columns.

smci
  • 32,567
  • 20
  • 113
  • 146
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • 1
    Renaming the columns from an aggregate [Naming returned columns in Pandas aggregate function](https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function) – smci May 08 '18 at 23:13
  • 1
    So you want to aggregate to get the respective totals, then get their ratio to get the average, i.e. 'summarize' your summary. I tried to reword to make this clearer. – smci May 08 '18 at 23:17

1 Answers1

1

As they depend on each other we could add it on a separate line of code after applying the .agg().

How about this (full example)

import pandas as pd

data = '''\
user_id   session_id    youtube_id 
1           1             2342 
1           1             3523
1           2             3325
2           1             3423
2           1             2352
2           1             3333 
2           2             2351
2           2             9876
2           3             2388'''

df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')

df = df.groupby('user_id').agg({'session_id': 'nunique', 'youtube_id': 'count'})
df.columns = ['total sessions','total views']

df['average view per session'] = df['total views']/df['total sessions']

print(df)

Returns:

         total sessions  total views  average view per session
user_id                                                       
1                     2            3                       1.5
2                     3            6                       2.0
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • Missing `from io import StringIO` – smci May 08 '18 at 23:06
  • 1
    @smci Thanks, but actually not necessary. Been using `pd.compat.StringIO()` for demo-purposes. It should work for both py2 and py3. – Anton vBR May 08 '18 at 23:07
  • 1
    Also don't you mean "a new column" instead of "row"? – smci May 08 '18 at 23:08
  • @smci I meant a new row (as in a new line of code). It is indeed far-fetched. Edited, thanks. – Anton vBR May 08 '18 at 23:08
  • Time to sort out which import: [Should we use pandas.compat.StringIO or Python 2/3 StringIO?](https://stackoverflow.com/questions/50283292/should-we-use-pandas-compat-stringio-or-python-2-3-stringio) – smci May 11 '18 at 00:29