1

I have a DataFrame as below.

Value1 Value2 Value3
30000  40000  50000
null   20000  10000

Also, I have a UDF created as

val testUDF=udf((a: Double, b: Double, c: Double) => {
   if(a==null && b!=null && c!=null)
       b+c
   else
       a+b+c
})

I have a code as below.

input.withColumn("checkNull", testUDF(col("value1"),col("value2"),col("value3"))).show

Resulting dataframe is as

Value1 Value2 Value3 checkNull
30000  40000  50000  120000
null   20000  10000   null

Here, instead of displaying 3000 for second row for the column "checkNUll", it displayed, null. Any thing I am doing wrong in my code? I dont want to replace it with 0. Because, if I want to do multiplication instead of addition above, it will fail.

KishoreKumar
  • 143
  • 1
  • 3
  • 11

3 Answers3

2

There is so many ways to do what you are trying to do and with the current information, I'd suggest using coalesce :

df.withColumn("x4", 
                   coalesce(
                        $"x1".cast("long") * $"x2" * $"x3", 
                        $"x2".cast("long") * $"x3", 
                        lit(0)
                   )
  ).show
+-----+-----+-----+--------------+
|   x1|   x2|   x3|            x4|
+-----+-----+-----+--------------+
|30000|40000|50000|60000000000000|
| null|20000|10000|     200000000|
+-----+-----+-----+--------------+

I'm casting to long due to this issue Why do these two multiplication operations give different results?

Here is another solution filling the column x1 with 1 to multiply :

df.na.fill(1, Seq("x1")).withColumn("x4", $"x1".cast("long") * $"x2" * $"x3").show

It yields the same results.

One other option is to do the following :

df.withColumn("x4", 
                when($"x1".isNull and $"x2".isNotNull and $"x3".isNotNull, $"x2".cast("long") * $"x3")
                .otherwise($"x1".cast("long") * $"x2"* $"x3")
  ).show

I advice you to read the following entry about Dealing with Null in Apache Spark.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • how would the first part result in `200000000`? Casting a `null` to `long` doesn't do anything, does it? Wouldn't you get a `0` in the second row because the first two parameters to `coalesce` would result in a `null`? – philantrovert Aug 04 '17 at 07:10
  • casting a null into long will yield a null thus the first expression is null. I'm sure of that. – eliasah Aug 04 '17 at 07:12
  • 1
    Oh, I thought it was `$"x1"` in the second parameter as well. My bad. – philantrovert Aug 04 '17 at 07:13
  • No worries @philantrovert – eliasah Aug 04 '17 at 07:14
  • Spark has a special NaN semantics which makes your UDF not working https://spark.apache.org/docs/latest/sql-programming-guide.html#nan-semantics – eliasah Aug 04 '17 at 07:29
0

You have to use filter in such condition because when value1 is null it wont call udf.

val testStr = """[{"Val1":1000, "Val2":2000, "Val3":4000},{"Val2":2000, "Val3":4000}]"""
    val rdd = sc.parallelize(Seq(testStr))
          val df = sqlContext.read.json(rdd)

          val newdf = df.filter(df.col("Val1").isNotNull).withColumn("checkNull",df.col("Val1")+df.col("Val2")+df.col("Val3"))
            val newdfw2 = df.filter(df.col("Val1").isNull).withColumn("checkNull",df.col("Val2")+df.col("Val3"))
            val alldf = newdf.unionAll(newdfw2)
          alldf.show()
0

You can simply use na.fill(0.0) to replace all null values as 0.0 and just use + notation to add the column values as

df.select($"Value1".cast(DoubleType), $"Value2".cast(DoubleType), $"Value3".cast(DoubleType))
    .na.fill(0.0)
    .withColumn("checkNull", $"Value1"+$"Value2"+$"Value3")
    .show(false)

Given dataframe as

+------+------+------+
|Value1|Value2|Value3|
+------+------+------+
|30000 |40000 |50000 |
|null  |20000 |10000 |
+------+------+------+

you should have output as

+-------+-------+-------+---------+
|Value1 |Value2 |Value3 |checkNull|
+-------+-------+-------+---------+
|30000.0|40000.0|50000.0|120000.0 |
|0.0    |20000.0|10000.0|30000.0  |
+-------+-------+-------+---------+

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97