I have a Pyspark dataframe with some non-unique key key
and some columns number
and value
.
For most keys
, the number
column goes from 1 to 12, but for some of them, there are gaps in numbers
(for ex. we have numbers [1, 2, 5, 9]
). I would like to add missing rows, so that for every key
we have all the numbers
in range 1-12 populated with the last seen value.
So that for table
key number value
a 1 6
a 2 10
a 5 20
a 9 25
I would like to get
key number value
a 1 6
a 2 10
a 3 10
a 4 10
a 5 20
a 6 20
a 7 20
a 8 20
a 9 25
a 10 25
a 11 25
a 12 25
I thought about creating a table of a
and an array of 1-12, exploding the array and joining with my original table, then separately populating the value
column with previous value using a window function bounded by current row. However, it seems a bit inelegant and I wonder if there is a better way to achieve what I want?