1

I have a dataframe with values for a certain store at the certain month. It looks like

df:pyspark.sql.dataframe.DataFrame

TRADEID:integer
time_period:date
VALUE:double

Ideally for each TRADEID there should be values for each month over the year, however, for some TRADEIDs some months are skipped.

I need to fill those gaps with values from the last available month such as

Old df

|TRADEID|time_period|value|
+-------+-----------+-----+
|      1| 31-01-2019|    5|
|      1| 31-03-2019|    6|
|      2| 31-01-2019|   15|
|      2| 31-03-2019|   20|
+-------+-----------+-----+

New df

|TRADEID|time_period|value|
+-------+-----------+-----+
|      1| 31-01-2019|    5|
|      1| 28-02-2019|    5|
|      1| 31-03-2019|    6|
|      2| 31-01-2019|   15|
|      2| 28-02-2019|   15|
|      2| 31-03-2019|   20|
+-------+-----------+-----+

As I understood I can use UDAF and the process would look like
df.groupby('TRADEID').apply(UDAF)

There is a similar question, but it does not answers the following questions:
1. Is it optimal way from performance point of view? There are lots of TRADEID's and millions of rows of data.
2. Any advice of how to write the UDAF I need with good performance. How to define which months are missing? I can create reference dataframe with all needed month and do outer join, but what is the nice way to fill values from the last available month to added months?

kndl
  • 27
  • 5
  • Possible duplicate of [Filling gaps in timeseries Spark](https://stackoverflow.com/questions/42411184/filling-gaps-in-timeseries-spark) – pault Aug 16 '19 at 13:38

0 Answers0