0

I have a a table which contains users subscriptions to a service.

When there is a change in a user's status a new row is inserted in the table with same user_id, the new time of the update and the new status.

Example table:

example = {'updated_at':['2017-01-01', '2017-01-01', '2017-01-02','2017-01-02', '2017-01-03','2017-01-03','2017-01-03','2017-01-04','2017-01-04'],
           'user_id': [9, 10, 11, 12, 13, 12, 14, 11, 10],
           'status': ['trial', 'cancelled', 'trial', 'trial', 'trial', 'active', 'active', 'active', 'active']}

example_data = pd.DataFrame.from_dict(example)

What I'm trying to build is a table which shows the count of users grouped by status for every day. If there is no change from one day to another then the table should show the same data both days.

If one of the users updates its status then from the day after that the columns should reflect the new breakdown of status.

Example desired outcome:

print desired_results
          date  n     status
0   2017-01-01  0     active
1   2017-01-01  1  cancelled
2   2017-01-01  1      trial
3   2017-01-02  0     active
4   2017-01-02  1  cancelled
5   2017-01-02  3      trial
6   2017-01-03  2     active
7   2017-01-03  1  cancelled
8   2017-01-03  4      trial
9   2017-01-04  4     active
10  2017-01-04  0  cancelled
11  2017-01-04  2      trial

What I have tried so far:

(example_data.
 groupby(['updated_at','status']). #group by date and status
 count(). # count in each group
 unstack(fill_value = 0). # unstack
 cumsum(). # cumsum for each status
 stack(). # stack all together again
 reset_index().rename(columns={'user_id': 'n'})) # rename column and reindex 

but this does not give me what I was hoping. The problem is that I'm counting updates as new records and therefore users are double counted.

Can anyone help with this?

Giacomo
  • 1,796
  • 1
  • 24
  • 35

2 Answers2

0

I'm not really sure what you want but seems like this would work ?

(example_data.
  groupby(['updated_at','status']). #group by date and status
  count(). # count in each group
  unstack(fill_value = 0). # unstack
  stack(). # stack all together again
  reset_index().rename(columns={'user_id': 'n'})) # rename column and reindex 
Dionys
  • 3,083
  • 1
  • 19
  • 28
  • Hi Dionys, thanks but this is not giving the expected results. I have print example_results to show what I want. Which part is not clear? – Giacomo Oct 04 '17 at 08:17
0

I realised that what I actually had to do was simply to 'carry on' previous observations for each user for every date and then just do a simple groupby and count.

I have found this answer on SO about how to reindex dates in Groupby, which put me on the right track.

My solution looks like this:

def reindex_by_date(df):
    dates = ['2017-01-01','2017-01-02','2017-01-03','2017-01-04']
    return df.reindex(dates, method = 'ffill')

(example_data.
 groupby('user_id').
 apply(reindex_by_date).
 reset_index(0, drop=True).
 reset_index().
 groupby(['status', 'index']).
 count().
 reset_index().
 sort_values(['index','status']).
 rename(columns={'index':'date'})
)

The above seems a bit redundant, there is probably a cleaner way to do it.

Giacomo
  • 1,796
  • 1
  • 24
  • 35
  • ok, I will accept my own answer as soon as I am allowed to and unless someone else replies with a cleaner solution. :) – Giacomo Oct 05 '17 at 13:09