Pyspark Coalesce with first non null and most recent nonnull values

Question

I have a dataframe with a column has null sofr first few and last few rows. How do I coalesce this column using the first non-null value and the last non-null record?

For example say I have the following dataframe:

What'd I'd want to produce is the following:

So as you can see the first two rows get populated with 0.6 because that is the first non-null record. The last several rows become 3 because that was the last non-null record.

AFAIK in this context `coalesce` refers to merging two or more _columns_ filling the null-values of the first column with the values of the second column. What you are asking for, I would call "foward filling" and "backward filling". — Til Piffl, Jan 19 '22 at 15:31
Yes, a bit nerdy :) I just mentioned it, because with such search terms you would find related other questions/answers like this one: https://stackoverflow.com/a/64627448/5118843 — Til Piffl, Jan 19 '22 at 15:56

score 1 · Accepted Answer · answered Jan 19 '22 at 15:22

You can use last() for filling and Window for sorting:

from pyspark.sql import Row, Window, functions as F

df = sql_context.createDataFrame([
    Row(Month=datetime.date(2021,1,1), Rating=None),
    Row(Month=datetime.date(2021,2,1), Rating=None),
    Row(Month=datetime.date(2021,3,1), Rating=0.6),
    Row(Month=datetime.date(2021,4,1), Rating=1.2),
    Row(Month=datetime.date(2021,5,1), Rating=1.),
    Row(Month=datetime.date(2021,6,1), Rating=None),
    Row(Month=datetime.date(2021,7,1), Rating=None),
])

(
    df
    .withColumn('Rating', 
                F.when(F.isnull('Rating'),
                       F.last('Rating', ignorenulls=True).over(Window.orderBy('Month'))
                ).otherwise(F.col('Rating')))
    # This second run below is only required for the first rows in the DF
    .withColumn('Rating',
                F.when(F.isnull('Rating'),
                       F.last('Rating', ignorenulls=True).over(Window.orderBy(F.desc('Month')))
                ).otherwise(F.col('Rating')))
    .sort('Month') # Only required for output formatting
    .show()
)
# Output
+----------+------+
|     Month|Rating|
+----------+------+
|2021-01-01|   0.6|
|2021-02-01|   0.6|
|2021-03-01|   0.6|
|2021-04-01|   1.2|
|2021-05-01|   1.0|
|2021-06-01|   1.0|
|2021-07-01|   1.0|
+----------+------+

Pyspark Coalesce with first non null and most recent nonnull values

1 Answers1