PySpark - Row count based for day on start/end dates

Question

Say I have a dataframe as follows, representing accounts who are subscribed to a particular service, the date the subscription began, and the date the subscription ended (or null, if it hasn't ended yet).

ACCOUNT_ID   START_DATE   END_DATE
1            2018-05-01   2018-09-07
2            2018-07-20   2018-08-10
3            2018-06-10   null

My goal is to generate a table that belows, that has, for each month, how many people were subscribed to the service on the first day of that month.

DATE        SUBSCRIBER_COUNT
2018-01-01    0
2018-02-01    0
2018-03-01    0
2018-04-01    0
2018-05-01    1
2018-06-01    2
2018-07-01    3
2018-08-01    3
2018-09-01    2
2018-10-01    1
2018-11-01    1

Approaches I have tried:

Somehow explode the dataframe such that for each account, I have multiple rows, one row for each day the account has been a subscriber. Then filter that to the days that I am interested in (i.e., first day of each month), then groupby day and count the accounts.
- Not sure how to generate such an exploded dataset without using nasty UDFs.
Generate a list of the days that I am interested in; for each one, add a new boolean column to the original dataframe named subscribed_yyyymmdd by doing start < day && day < end. Then boolcount these columns to get the subscriber counts.
- It's clunky to have to add new columns, and also doesn't scale well if instead I wanted to do this daily not monthly

Possible duplicate of [Count the number of sessions if the beginning and end of each session is known](https://stackoverflow.com/questions/52673274/count-the-number-of-sessions-if-the-beginning-and-end-of-each-session-is-known) — cronoik, Nov 01 '18 at 22:27

PySpark - Row count based for day on start/end dates

0 Answers0