4

I have some dataframe with dates column as "date" with type Timestamp

+------------+--------------+-----+
|   date     |         title|count|
+------------+--------------+-----+
|  2002-09-01|text1         |    2|
|  2003-03-01|text1         |    4|
|  2007-11-01|text1         |    4|

and I'd like to get all missing dates(in my case months) between these dates for the same title and initialize all of them with count = 0

+------------+--------------+-----+
|   date     |         title|count|
+------------+--------------+-----+
|  2002-09-01|text1         |    2|
|  2002-10-01|text1         |    0|
|  2002-11-01|text1         |    0|
|  2002-12-01|text1         |    0|
|  2003-01-01|text1         |    0|
|  2003-02-01|text1         |    0|
|  2003-03-01|text1         |    4|
|  2003-04-01|text1         |    0|
.........

should I use it for that LEAD function from Window Expression ? I found actually exactly what I need in other thread filling dates in scala but I'm not really good in scala and quite new in spark to adopt it for pyspark. I tried this one but not sure if is it the right way. Would appreciate any help!

def generate_date_series(start, stop):
    return [start + datetime.timedelta(days=x) for x in range(0, (stop-start).days + 1)]    

w = Window.orderBy("date")
tmpDF = df.withColumn("diff", datediff(f.lead(col("date"),1).over(w), col("date")))
    .filter(col("diff")>31)
    .withColumn("date", generate_date_series(explode(....)))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
LDropl
  • 846
  • 3
  • 9
  • 25
  • Please check https://stackoverflow.com/questions/42411184/filling-gaps-in-timeseries-spark – frb Feb 20 '19 at 15:59

0 Answers0