0

I have a dataframe which has missing values in a row, and I use df.ffill(axis=1, inplace=True) to perform the transformation using Pandas.

I want to understand what would be the PySpark equivalent way to achieve this. I have read about using Window functions but those work over the column axis.

Example :

Input :

id value1 value2 value3 value4 value5
A 2 3 NaN NaN 6
B 1 NaN NaN NaN NaN

Output :

id value1 value2 value3 value4 value5
A 2 3 3 3 6
B 1 1 1 1 1
  • ``df.fillna()``? [Spark-API](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.fillna.html) – JAdel Mar 09 '22 at 10:29
  • You can find your answer in this thread I guess: https://stackoverflow.com/questions/36019847/pyspark-forward-fill-with-last-observation-for-a-dataframe – seghair tarek Mar 09 '22 at 10:49
  • @seghairtarek As you can see, the requirement in that question was to perform forward filling on a column, whereas I need to perform the same on a row instead. Updating my question to add example. – Manish Tripathi Mar 09 '22 at 11:15

1 Answers1

1

You can use coalesce it will take values from value3 column if it's not null, otherwise from value2 column

from pyspark.sql.functions import coalesce

df = df.withColumn('value3', coalesce('value3', 'value2'))

To do this for all your dataset you simply do a for loop on all the columns. Like this :

from pyspark.sql.functions import coalesce

cols = df.columns
for i in range(1,len(cols)):
    df = df.withColumn(cols[i], coalesce(cols[i], cols[i-1]))