Pandas ffill() equivalent in PySpark

Question

I have a dataframe which has missing values in a row, and I use df.ffill(axis=1, inplace=True) to perform the transformation using Pandas.

I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.

Example :

Input :

id	value1	value2	value3	value4	value5
A	2	3	NaN	NaN	6
B	1	NaN	NaN	NaN	NaN

Output :

id	value1	value2	value3	value4	value5
A	2	3	3	3	6
B	1	1	1	1	1

``df.fillna()``? [Spark-API](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.fillna.html) — JAdel, Mar 09 '22 at 10:29
You can find your answer in this thread I guess: https://stackoverflow.com/questions/36019847/pyspark-forward-fill-with-last-observation-for-a-dataframe — seghair tarek, Mar 09 '22 at 10:49
@seghairtarek As you can see, the requirement in that question was to perform forward filling on a column, whereas I need to perform the same on a row instead. Updating my question to add example. — Manish Tripathi, Mar 09 '22 at 11:15

score 1 · Answer 1 · answered Mar 09 '22 at 11:59

You can use coalesce it will take values from value3 column if it's not null, otherwise from value2 column

from pyspark.sql.functions import coalesce

df = df.withColumn('value3', coalesce('value3', 'value2'))

To do this for all your dataset you simply do a for loop on all the columns. Like this :

from pyspark.sql.functions import coalesce

cols = df.columns
for i in range(1,len(cols)):
    df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))

Pandas ffill() equivalent in PySpark

1 Answers1