How to fill time gaps in PySpark dataframe with UDAF

Question

I have a dataframe with values for a certain store at the certain month. It looks like

df:pyspark.sql.dataframe.DataFrame

TRADEID:integer
time_period:date
VALUE:double

Ideally for each TRADEID there should be values for each month over the year, however, for some TRADEIDs some months are skipped.

I need to fill those gaps with values from the last available month such as

Old df

|TRADEID|time_period|value|
+-------+-----------+-----+
|      1| 31-01-2019|    5|
|      1| 31-03-2019|    6|
|      2| 31-01-2019|   15|
|      2| 31-03-2019|   20|
+-------+-----------+-----+

New df

|TRADEID|time_period|value|
+-------+-----------+-----+
|      1| 31-01-2019|    5|
|      1| 28-02-2019|    5|
|      1| 31-03-2019|    6|
|      2| 31-01-2019|   15|
|      2| 28-02-2019|   15|
|      2| 31-03-2019|   20|
+-------+-----------+-----+

As I understood I can use UDAF and the process would look like
df.groupby('TRADEID').apply(UDAF)

There is a similar question, but it does not answers the following questions:
1. Is it optimal way from performance point of view? There are lots of TRADEID's and millions of rows of data.
2. Any advice of how to write the UDAF I need with good performance. How to define which months are missing? I can create reference dataframe with all needed month and do outer join, but what is the nice way to fill values from the last available month to added months?

Possible duplicate of [Filling gaps in timeseries Spark](https://stackoverflow.com/questions/42411184/filling-gaps-in-timeseries-spark) — pault, Aug 16 '19 at 13:38

How to fill time gaps in PySpark dataframe with UDAF

0 Answers0