I have a a table which contains users subscriptions to a service.
When there is a change in a user's status a new row is inserted in the table with same user_id, the new time of the update and the new status.
Example table:
example = {'updated_at':['2017-01-01', '2017-01-01', '2017-01-02','2017-01-02', '2017-01-03','2017-01-03','2017-01-03','2017-01-04','2017-01-04'],
'user_id': [9, 10, 11, 12, 13, 12, 14, 11, 10],
'status': ['trial', 'cancelled', 'trial', 'trial', 'trial', 'active', 'active', 'active', 'active']}
example_data = pd.DataFrame.from_dict(example)
What I'm trying to build is a table which shows the count of users grouped by status for every day. If there is no change from one day to another then the table should show the same data both days.
If one of the users updates its status then from the day after that the columns should reflect the new breakdown of status.
Example desired outcome:
print desired_results
date n status
0 2017-01-01 0 active
1 2017-01-01 1 cancelled
2 2017-01-01 1 trial
3 2017-01-02 0 active
4 2017-01-02 1 cancelled
5 2017-01-02 3 trial
6 2017-01-03 2 active
7 2017-01-03 1 cancelled
8 2017-01-03 4 trial
9 2017-01-04 4 active
10 2017-01-04 0 cancelled
11 2017-01-04 2 trial
What I have tried so far:
(example_data.
groupby(['updated_at','status']). #group by date and status
count(). # count in each group
unstack(fill_value = 0). # unstack
cumsum(). # cumsum for each status
stack(). # stack all together again
reset_index().rename(columns={'user_id': 'n'})) # rename column and reindex
but this does not give me what I was hoping. The problem is that I'm counting updates as new records and therefore users are double counted.
Can anyone help with this?