I have a dataframe with values for a certain store at the certain month. It looks like
df:pyspark.sql.dataframe.DataFrame
TRADEID:integer
time_period:date
VALUE:double
Ideally for each TRADEID there should be values for each month over the year, however, for some TRADEIDs some months are skipped.
I need to fill those gaps with values from the last available month such as
Old df
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 31-03-2019| 6|
| 2| 31-01-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
New df
|TRADEID|time_period|value|
+-------+-----------+-----+
| 1| 31-01-2019| 5|
| 1| 28-02-2019| 5|
| 1| 31-03-2019| 6|
| 2| 31-01-2019| 15|
| 2| 28-02-2019| 15|
| 2| 31-03-2019| 20|
+-------+-----------+-----+
As I understood I can use UDAF and the process would look like
df.groupby('TRADEID').apply(UDAF)
There is a similar question, but it does not answers the following questions:
1. Is it optimal way from performance point of view? There are lots of TRADEID's and millions of rows of data.
2. Any advice of how to write the UDAF I need with good performance. How to define which months are missing? I can create reference dataframe with all needed month and do outer join, but what is the nice way to fill values from the last available month to added months?