I have a PySpark dataframe that keeps track of changes that occur in a product's price and status over months. This means that a new row is created only when a change occurred (in either status or price) compared to the previous month, like in the dummy data below
----------------------------------------
|product_id| status | price| month |
----------------------------------------
|1 | available | 5 | 2019-10|
----------------------------------------
|1 | available | 8 | 2020-08|
----------------------------------------
|1 | limited | 8 | 2020-10|
----------------------------------------
|2 | limited | 1 | 2020-09|
----------------------------------------
|2 | limited | 3 | 2020-10|
----------------------------------------
I would like to create a dataframe that shows the values for each of the last 6 months. This means that I need to duplicate the records whenever there is a gap in the above dataframe. For example, if the last 6 months are 2020-07, 2020-08, ... 2020-12, then the result for the above dataframe should be
----------------------------------------
|product_id| status | price| month |
----------------------------------------
|1 | available | 5 | 2020-07|
----------------------------------------
|1 | available | 8 | 2020-08|
----------------------------------------
|1 | available | 8 | 2020-09|
----------------------------------------
|1 | limited | 8 | 2020-10|
----------------------------------------
|1 | limited | 8 | 2020-11|
----------------------------------------
|1 | limited | 8 | 2020-12|
----------------------------------------
|2 | limited | 1 | 2020-09|
----------------------------------------
|2 | limited | 3 | 2020-10|
----------------------------------------
|2 | limited | 3 | 2020-11|
----------------------------------------
|2 | limited | 3 | 2020-12|
----------------------------------------
Notice that for product_id = 1 there was an older record from 2019-10 that was propagated until 2020-08 and then trimmed, whereas for product_id = 2 there were no records prior to 2020-09 and thus the months 2020-07, 2020-08 were not filled for it (as the product did not exist prior to 2020-09).
Since the dataframe consists of millions of records, a "brute-force" solution using for loops and checking for each product_id is rather slow. It seems that it should be possible to solve this using window functions, by creating another column next_month and then filling in the gaps based on that column, but I don't know how to achieve that.