I have some dataframe with dates column as "date" with type Timestamp
+------------+--------------+-----+
| date | title|count|
+------------+--------------+-----+
| 2002-09-01|text1 | 2|
| 2003-03-01|text1 | 4|
| 2007-11-01|text1 | 4|
and I'd like to get all missing dates(in my case months) between these dates for the same title and initialize all of them with count = 0
+------------+--------------+-----+
| date | title|count|
+------------+--------------+-----+
| 2002-09-01|text1 | 2|
| 2002-10-01|text1 | 0|
| 2002-11-01|text1 | 0|
| 2002-12-01|text1 | 0|
| 2003-01-01|text1 | 0|
| 2003-02-01|text1 | 0|
| 2003-03-01|text1 | 4|
| 2003-04-01|text1 | 0|
.........
should I use it for that LEAD function from Window Expression ? I found actually exactly what I need in other thread filling dates in scala but I'm not really good in scala and quite new in spark to adopt it for pyspark. I tried this one but not sure if is it the right way. Would appreciate any help!
def generate_date_series(start, stop):
return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]
w = Window.orderBy("date")
tmpDF = df.withColumn("diff", datediff(f.lead(col("date"),1).over(w), col("date")))
.filter(col("diff")>31)
.withColumn("date", generate_date_series(explode(....)))