2

I am currently tracking monthly counts for users within my product. This issue with this is I will be missing rows when a user does not have any activity in a particular month. Here is an example:

Min Month:

+---------------+
|min(year_month)|
+---------------+
|        2019_05|
+---------------+

Max Month:

+---------------+
|max(year_month)|
+---------------+
|        2020_06|
+---------------+

User Data:

+--------------------+----------+----------------------+
|             core_id|year_month|month_sum_detailaction|
+--------------------+----------+----------------------+
|000006c9-d42b-4fe...|   2019_09|                     3|
|000006c9-d42b-4fe...|   2020_01|                     2|
|000006c9-d42b-4fe...|   2020_02|                     6|
+--------------------+----------+----------------------+

As you can see, this user has only had activity in three months of the 12 months.

Would I would like to do is update the data for each user to look something like this:

+--------------------+----------+----------------------+
|             core_id|year_month|month_sum_detailaction|
+--------------------+----------+----------------------+
|000006c9-d42b-4fe...|   2019_05|                     0|
|000006c9-d42b-4fe...|   2020_06|                     0|
|000006c9-d42b-4fe...|   2020_07|                     0|
|000006c9-d42b-4fe...|   2020_08|                     0|
|000006c9-d42b-4fe...|   2019_09|                     3|
|000006c9-d42b-4fe...|   2020_10|                     0|
|000006c9-d42b-4fe...|   2020_11|                     0|
|000006c9-d42b-4fe...|   2019_12|                     0|
|000006c9-d42b-4fe...|   2020_01|                     2|
|000006c9-d42b-4fe...|   2020_02|                     6|
|000006c9-d42b-4fe...|   2020_03|                     0|
|000006c9-d42b-4fe...|   2020_04|                     0|
|000006c9-d42b-4fe...|   2020_05|                     0|
|000006c9-d42b-4fe...|   2020_06|                     0|
+--------------------+----------+----------------------+

I'm relatively new to pyspark so any help would be much appreciated.

Madhav Thaker
  • 360
  • 2
  • 12
  • Something like this could get you started: https://stackoverflow.com/questions/39271374/pyspark-how-to-resample-frequencies – data princess Jun 09 '20 at 19:59
  • Does this answer your question? [Create PySpark dataframe : sequence of months with year](https://stackoverflow.com/questions/57426093/create-pyspark-dataframe-sequence-of-months-with-year) – data princess Jun 09 '20 at 20:14

1 Answers1

4

Try this. You can also collect() max and min year_month and put it in sequence function.

from pyspark.sql import functions as F

df.groupBy("core_id").agg(F.collect_list(F.to_date("year_month","yyyy_MM")).alias("year_month"),\
                          F.collect_list("month_sum_detailaction").alias("month_sum_detailaction"))\
  .withColumn("seq", F.expr("""sequence(to_date('2019_05','yyyy_MM'),to_date('2020_06','yyyy_MM'),interval 1 month)"""))\
  .withColumn("year_month", F.flatten(F.array("year_month", F.array_except("seq","year_month"))))\
  .withColumn("zip", F.explode(F.arrays_zip("year_month"\
                                 ,"month_sum_detailaction")))\
  .select("core_id", F.col("zip.*"))\
  .withColumn("year_month", F.date_format("year_month", "yyyy_MM")).orderBy("year_month").fillna(0).show()

#+--------------------+----------+----------------------+
#|             core_id|year_month|month_sum_detailaction|
#+--------------------+----------+----------------------+
#|000006c9-d42b-4fe...|   2019_05|                     0|
#|000006c9-d42b-4fe...|   2019_06|                     0|
#|000006c9-d42b-4fe...|   2019_07|                     0|
#|000006c9-d42b-4fe...|   2019_08|                     0|
#|000006c9-d42b-4fe...|   2019_09|                     3|
#|000006c9-d42b-4fe...|   2019_10|                     0|
#|000006c9-d42b-4fe...|   2019_11|                     0|
#|000006c9-d42b-4fe...|   2019_12|                     0|
#|000006c9-d42b-4fe...|   2020_01|                     2|
#|000006c9-d42b-4fe...|   2020_02|                     6|
#|000006c9-d42b-4fe...|   2020_03|                     0|
#|000006c9-d42b-4fe...|   2020_04|                     0|
#|000006c9-d42b-4fe...|   2020_05|                     0|
#|000006c9-d42b-4fe...|   2020_06|                     0|
#+--------------------+----------+----------------------+
murtihash
  • 8,030
  • 1
  • 14
  • 26