I am currently tracking monthly counts for users within my product. This issue with this is I will be missing rows when a user does not have any activity in a particular month. Here is an example:
Min Month:
+---------------+
|min(year_month)|
+---------------+
| 2019_05|
+---------------+
Max Month:
+---------------+
|max(year_month)|
+---------------+
| 2020_06|
+---------------+
User Data:
+--------------------+----------+----------------------+
| core_id|year_month|month_sum_detailaction|
+--------------------+----------+----------------------+
|000006c9-d42b-4fe...| 2019_09| 3|
|000006c9-d42b-4fe...| 2020_01| 2|
|000006c9-d42b-4fe...| 2020_02| 6|
+--------------------+----------+----------------------+
As you can see, this user has only had activity in three months of the 12 months.
Would I would like to do is update the data for each user to look something like this:
+--------------------+----------+----------------------+
| core_id|year_month|month_sum_detailaction|
+--------------------+----------+----------------------+
|000006c9-d42b-4fe...| 2019_05| 0|
|000006c9-d42b-4fe...| 2020_06| 0|
|000006c9-d42b-4fe...| 2020_07| 0|
|000006c9-d42b-4fe...| 2020_08| 0|
|000006c9-d42b-4fe...| 2019_09| 3|
|000006c9-d42b-4fe...| 2020_10| 0|
|000006c9-d42b-4fe...| 2020_11| 0|
|000006c9-d42b-4fe...| 2019_12| 0|
|000006c9-d42b-4fe...| 2020_01| 2|
|000006c9-d42b-4fe...| 2020_02| 6|
|000006c9-d42b-4fe...| 2020_03| 0|
|000006c9-d42b-4fe...| 2020_04| 0|
|000006c9-d42b-4fe...| 2020_05| 0|
|000006c9-d42b-4fe...| 2020_06| 0|
+--------------------+----------+----------------------+
I'm relatively new to pyspark so any help would be much appreciated.