Say I have a dataframe as follows, representing accounts who are subscribed to a particular service, the date the subscription began, and the date the subscription ended (or null, if it hasn't ended yet).
ACCOUNT_ID START_DATE END_DATE
1 2018-05-01 2018-09-07
2 2018-07-20 2018-08-10
3 2018-06-10 null
My goal is to generate a table that belows, that has, for each month, how many people were subscribed to the service on the first day of that month.
DATE SUBSCRIBER_COUNT
2018-01-01 0
2018-02-01 0
2018-03-01 0
2018-04-01 0
2018-05-01 1
2018-06-01 2
2018-07-01 3
2018-08-01 3
2018-09-01 2
2018-10-01 1
2018-11-01 1
Approaches I have tried:
- Somehow explode the dataframe such that for each account, I have multiple rows, one row for each day the account has been a subscriber. Then filter that to the days that I am interested in (i.e., first day of each month), then groupby day and count the accounts.
- Not sure how to generate such an exploded dataset without using nasty UDFs.
- Generate a list of the days that I am interested in; for each one, add a new boolean column to the original dataframe named
subscribed_yyyymmdd
by doingstart < day && day < end
. Then boolcount these columns to get the subscriber counts.- It's clunky to have to add new columns, and also doesn't scale well if instead I wanted to do this daily not monthly