How to use specific UDF to restore column values?

Question

I have a dataframe which is the following :

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |NULL    |M      |
|date_4   |NULL    |S      |
+---------+--------+-------+

I want to restore the id (NULL) values as below :

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |5697    |M      |
|date_4   |5697    |S      |
+---------+--------+-------+

Is there a way to achieve this ?

Thank you for your answers.

You need to be more specific about the requirement, is the ID always constant, do you want to fill in 5697 every time there is null in a column ? — Chitral Verma, Aug 14 '19 at 10:35
Minor point : why do you specifically want a UDF based solution ? Is it a requirement of yours, or any other means is OK with you ? — GPI, Aug 14 '19 at 10:36

score 0 · Accepted Answer · answered Aug 14 '19 at 15:20

Bonjour Doc, Le na.fill fait bien le taff :

val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))

val df = rdd.toDF("date", "id", "typ_mvt")

import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date") 
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))

Sinon, j'ai trouvé le post suivant très similaire avec une bien meilleur solution : Fill in null with previously known good value with pyspark

How to use specific UDF to restore column values?

1 Answers1