Fill the missing dates between rows in pyspark dataframe

Question

I have some dataframe with dates column as "date" with type Timestamp

+------------+--------------+-----+
|   date     |         title|count|
+------------+--------------+-----+
|  2002-09-01|text1         |    2|
|  2003-03-01|text1         |    4|
|  2007-11-01|text1         |    4|

and I'd like to get all missing dates(in my case months) between these dates for the same title and initialize all of them with count = 0

+------------+--------------+-----+
|   date     |         title|count|
+------------+--------------+-----+
|  2002-09-01|text1         |    2|
|  2002-10-01|text1         |    0|
|  2002-11-01|text1         |    0|
|  2002-12-01|text1         |    0|
|  2003-01-01|text1         |    0|
|  2003-02-01|text1         |    0|
|  2003-03-01|text1         |    4|
|  2003-04-01|text1         |    0|
.........

should I use it for that LEAD function from Window Expression ? I found actually exactly what I need in other thread filling dates in scala but I'm not really good in scala and quite new in spark to adopt it for pyspark. I tried this one but not sure if is it the right way. Would appreciate any help!

def generate_date_series(start, stop):
    return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]    

w = Window.orderBy("date")
tmpDF = df.withColumn("diff", datediff(f.lead(col("date"),1).over(w), col("date")))
    .filter(col("diff")>31)
    .withColumn("date", generate_date_series(explode(....)))

Please check https://stackoverflow.com/questions/42411184/filling-gaps-in-timeseries-spark — frb, Feb 20 '19 at 15:59

Fill the missing dates between rows in pyspark dataframe

0 Answers0