0

I have a dataframe which is the following :

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |NULL    |M      |
|date_4   |NULL    |S      |
+---------+--------+-------+

I want to restore the id (NULL) values as below :

+---------+--------+-------+
|date     |id      |typ_mvt|
+---------+--------+-------+
|date_1   |5697    |C      |
|date_2   |5697    |M      |
|date_3   |5697    |M      |
|date_4   |5697    |S      |
+---------+--------+-------+

Is there a way to achieve this ?

Thank you for your answers.

Mamaf
  • 345
  • 4
  • 10
  • 4
    You need to be more specific about the requirement, is the ID always constant, do you want to fill in 5697 every time there is null in a column ? – Chitral Verma Aug 14 '19 at 10:35
  • 2
    Minor point : why do you specifically want a UDF based solution ? Is it a requirement of yours, or any other means is OK with you ? – GPI Aug 14 '19 at 10:36

1 Answers1

0

Bonjour Doc, Le na.fill fait bien le taff :

val rdd = sc.parallelize(Seq(
(201901, new Integer(5697), "C"),
(201902, new Integer(5697), "M"),
(201903, null.asInstanceOf[Integer], "M"),
(201904, null.asInstanceOf[Integer], "S")
))

val df = rdd.toDF("date", "id", "typ_mvt")

import org.apache.spark.sql.functions.{lag,lead}
val window = org.apache.spark.sql.expressions.Window.orderBy("date") 
val sampleId = df.filter($"id".isNotNull).select($"id").first.getInt(0)
val newDf = df.na.fill(sampleId,Seq("id"))

Sinon, j'ai trouvé le post suivant très similaire avec une bien meilleur solution : Fill in null with previously known good value with pyspark

Ali
  • 18
  • 5