I have the following challenge. I have the following dataframe called: defined_conversions
user_id pageviews conversion timestamp
1 3 True 08:01:12
1 4 False 07:02:14
1 7 False 08:02:14
2 2 True 10:12:15
2 2 False 05:12:18
What I want to achieve is add an additional column called sum_pageviews that takes the cumulative sum of the pageviews per user.
I built this function to achieve this:
def pageviews_per_user(defined_conversions):
defined_conversions['sum_pageviews'] = defined_conversions.groupby(['user_id'])['pageviews'].cumsum
return defined_conversions
What I am worried about is that dataframe will look like this:
user_id pageviews conversion timestamp sum_pageviews
1 3 True 08:01:12 14
1 4 False 07:02:14 14
1 7 False 08:02:14 14
2 2 True 10:12:15 4
2 2 False 05:12:18 4
I want it to look like:
user_id pageviews conversion timestamp sum_pageviews
1 3 True 08:01:12 3
1 4 False 07:02:14 7
1 7 False 08:02:14 14
2 2 True 10:12:15 2
2 2 False 05:12:18 4
So in essence the pageviews should sum cumulative following the timestamp. Should I sort my data on timestamp first before running the cumsum formula? Or should I do something else?
ps: I am a beginner with python/pandas
Thanks in advance!