1

Trying to replace null with 0 in the Dataframe using the UDF below. Where I could be going wrong, the code seems straight forward but it's not working as expected.

I tried to create a UDF which replaces 0 in any column whose value is null.

Thank you All in Advance.

//imports

object PlayGround {
def missingValType2(n: Int):Int = {
    if(n == null){
      0
    }else{
      n
    }
  }

   def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession
      .builder()
      .appName("PlayGround")
      .config("spark.sql.warehouse.dir", "file:///C:/temp")
      .master("local[*]")
      .getOrCreate()

    val missingValUDFType2 = udf[Int, Int](missingValType2)

     val schema = List(
      StructField("name", types.StringType, false),
      StructField("age", types.IntegerType, true)
    )

    val data = Seq(
      Row("miguel", null),
      Row("luisa", 21)
    )
    val df = spark.createDataFrame(
      spark.sparkContext.parallelize(data),
      StructType(schema)
    )
    df.show(false)
    df.withColumn("ageNullReplace",missingValUDFType2($"age")).show()

  }
}

/**
  * +------+----+
  * |name  |age |
  * +------+----+
  * |miguel|null|
  * |luisa |21  |
  * +------+----+
  *
  * Below is the current output.
  * +------+----+--------------+
  * |  name| age|ageNullReplace|
  * +------+----+--------------+
  * |miguel|null|          null|
  * | luisa|  21|            21|
  * +------+----+--------------+*/

Expected output:

 * +------+----+--------------+
  * |  name| age|ageNullReplace|
  * +------+----+--------------+
  * |miguel|null|             0|
  * | luisa|  21|            21|
  * +------+----+--------------+
Pavan_Obj
  • 1,071
  • 1
  • 12
  • 26
  • Why are you trying to use a UDF? you can just use when within your withColumn to do the same. UDF's are not recommended if you have a native function that can do the same – Aaron May 15 '19 at 20:15
  • Hello, @user2315840 thanks for responding, yes I could've done that. I read this https://www.mungingdata.com/apache-spark/dealing-with-null section User Defined Functions, and I thought its not advisable. Let me try your solution quickly. – Pavan_Obj May 15 '19 at 20:21

2 Answers2

2

There is no need for a UDF. You can apply na.fill to a list of type-specific columns in the DataFrame, as shown below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("miguel", None), ("luisa", Some(21))
).toDF("name", "age")

df.na.fill(0, Seq("age")).show
// +------+---+
// |  name|age|
// +------+---+
// |miguel|  0|
// | luisa| 21|
// +------+---+
Leo C
  • 22,006
  • 3
  • 26
  • 39
  • Hello Leo, Thank you for that. Does None infer as null? I want to replace age whose value is null to 0. or some random number. – Pavan_Obj May 15 '19 at 20:22
  • Also, I learned a lesson here: https://stackoverflow.com/questions/15777745/how-does-comparison-operator-works-with-null-int/15777768 comparing values of types Int and Null using '!=' or '==' will always yield true in case of != or false in case of == . if(n != null){ – Pavan_Obj May 15 '19 at 20:48
  • And also did something similar in sql way: spark.sql("select *, case when age is null then '0' else age end as extraColumn from df2").show() – Pavan_Obj May 15 '19 at 21:07
  • The equivalence of SQL's `case when/then/else/end` in Spark is `when/otherwise`, which is what the other answer suggests. Another option is to use `coalesce`, like `df.withColumn("age", coalesce($"age", lit(0)))`. Nevertheless, the proposed `na.fill` approach provides flexibility of applying `null` replacement to multiple columns all at once. – Leo C May 15 '19 at 21:33
  • Yes Leo, But when I used the above Solution, it will work for sure but for me df.withColumn("ageNullReplace", when(col("age").isNull,functions.lit(0)).otherwise(col("age"))) - erroring out cannot resolve symbol when. – Pavan_Obj May 15 '19 at 22:33
  • The `when` function should be available if you had `import org.apache.spark.sql.functions._`. – Leo C May 15 '19 at 22:43
1

you can use WithColumn with a when condition like below Code is not tested

df.withColumn("ageNullReplace", when(col("age").isNull,lit(0)).otherwise(col(age)))

in the above code Otherwise is not required just FYI

Hope that helps

Aaron
  • 662
  • 8
  • 20
  • Hey user2315840, I tried the exact same: df.withColumn("ageNullReplace", when(col("age").isNull,functions.lit(0)).otherwise(col("age"))) But the keyword 'when' cannot resolve. – Pavan_Obj May 15 '19 at 20:36
  • can you import org.apache.spark.sql.functions._ – Aaron May 16 '19 at 13:44