0

I have the following challenge. I have the following dataframe called: defined_conversions

user_id    pageviews    conversion    timestamp
1          3            True          08:01:12
1          4            False         07:02:14
1          7            False         08:02:14
2          2            True          10:12:15
2          2            False         05:12:18

What I want to achieve is add an additional column called sum_pageviews that takes the cumulative sum of the pageviews per user.

I built this function to achieve this:

def pageviews_per_user(defined_conversions):
    defined_conversions['sum_pageviews'] = defined_conversions.groupby(['user_id'])['pageviews'].cumsum
    return defined_conversions

What I am worried about is that dataframe will look like this:

   user_id    pageviews    conversion    timestamp    sum_pageviews
    1          3            True          08:01:12    14
    1          4            False         07:02:14    14
    1          7            False         08:02:14    14
    2          2            True          10:12:15    4
    2          2            False         05:12:18    4

I want it to look like:

  user_id    pageviews    conversion    timestamp    sum_pageviews
    1          3            True          08:01:12    3
    1          4            False         07:02:14    7
    1          7            False         08:02:14    14
    2          2            True          10:12:15    2
    2          2            False         05:12:18    4

So in essence the pageviews should sum cumulative following the timestamp. Should I sort my data on timestamp first before running the cumsum formula? Or should I do something else?

ps: I am a beginner with python/pandas

Thanks in advance!

julien1337
  • 37
  • 4

1 Answers1

3

You're close--you just need to call cumsum():

>>> df.sort_values([by, 'timestamp']).groupby('user_id')['pageviews'].cumsum()
0     3
1     7
2    14
3     2
4     4
Name: pageviews, dtype: int64

As a function:

def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
    df.sort_values([by, 'timestamp'], inplace=True)
    df['sum_pageviews'] = df.groupby(by=by, sort=False, **kwargs)[aggcol].cumsum()
    return df

Note that this will not just return the DataFrame but modify it in-place.


Here's how you would use the function:

>>> df
   user_id  pageviews  conversion timestamp
0        1          3        True  08:01:12
1        1          4       False  07:02:14
2        1          7       False  08:02:14
3        2          2        True  10:12:15
4        2          2       False  05:12:18
>>> def pageviews_per_user(df, by='user_id', aggcol='pageviews', **kwargs):
...     df.sort_values([by, 'timestamp'], inplace=True)
...     df['sum_pageviews'] = df.groupby(by=by, **kwargs)[aggcol].cumsum()
...     return df
... 
>>> pageviews_per_user(df)
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4
>>> df
   user_id  pageviews  conversion timestamp  sum_pageviews
1        1          4       False  07:02:14              4
0        1          3        True  08:01:12              7
2        1          7       False  08:02:14             14
4        2          2       False  05:12:18              2
3        2          2        True  10:12:15              4

Although timestamp is not a column of datetimes (just strings, as far as Pandas is concerned), it is still able to be sorted lexicographically.

The use of by, aggcol, and **kwargs are means of making your function a bit more generalizable if you'd like to group on other column names. If not, you could also hardcode these into the function body as is done in your question. **kwargs lets you pass any additional keyword arguments to groupby()

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Thanks for the answer Brad, the code works but looks very advanced and considering my level of Python I would prefer to start off with a little bit of a simpler solution. Can I adjust my own code a bit to make it work? In essence: timestamp should be sorted from early to lastest: every minute a user is possible to view a page. Only the latest timestamp of every user is able to get the value: conversion = true, there are no later sessions(timestamps) per user after this. Hence I want the pageviews to sum up cumulative based on timestamp. – julien1337 Oct 26 '18 at 16:57
  • @julien1337 I've added a bit more detail and explanation. Beyond that, good luck! – Brad Solomon Oct 26 '18 at 17:04