Spark dataset reduce with null values?

Question

I'm creating data frame with this code:

  val data = List(
    List(444.1235D),
    List(67.5335D),
    List(69.5335D),
    List(677.5335D),
    List(47.5335D),
    List(null)
  )

  val rdd = sparkContext.parallelize(data).map(Row.fromSeq(_))
  val schema = StructType(Array(
    StructField("value", DataTypes.DoubleType, true)
  ))

  val df = sqlContext.createDataFrame(rdd, schema)

Then I apply my udf to it:

val multip: Dataset[Double] = df.select(doubleUdf(df("value"))).as[Double]

and then I'm trying to use reduce on this dataset:

val multipl = multip.reduce(_ * _)

And here I got 0.0 as a result. Also I tried to filter nulls out

val multipl = multip.filter(_ != null).reduce(_ * _)

with the same result. If I remove null value from data everything works as it should. How can I make reduce work with null values?

My udf is defined like this:

val doubleUdf: UserDefinedFunction = udf((v: Any) => Try(v.toString.toDouble).toOption)

What is doubleUdf defined as. – Justin Pihony Mar 31 '17 at 14:09 — Justin Pihony, Mar 31 '17 at 14:09
@JustinPihony I've added it to the question – sergeda Mar 31 '17 at 14:39 — sergeda, Mar 31 '17 at 14:39

score 5 · Accepted Answer · answered Mar 31 '17 at 14:13

5

I'll answer with a strong assumption that your doubleUdf function converts values to doubles, and rather than using an Option wrapper for nulls you are turning nulls into 0.0. So, if you want to keep the logic to drop nulls, then filter BEFORE anything else:

df.na.drop.select(doubleUdf(df("value"))).as[Double]

answered Mar 31 '17 at 14:13

Justin Pihony

66,056
18
147
180

Actually my udf returns Option as recommended here [link]http://stackoverflow.com/questions/32357164/sparksql-how-to-deal-with-null-values-in-user-defined-function[/link] Added it to the question – sergeda Mar 31 '17 at 14:38
Up-voted, but `select(doubleUdf(df("value")))` is completely superfluous here. Data is already a `DoubleType`, and if it wasn't it would be better to suggest a `cast`, don't you think? – zero323 Mar 31 '17 at 15:36
It's just example. This function suppose to work for other data types – sergeda Mar 31 '17 at 16:11

score 2 · Answer 2 · answered Mar 31 '17 at 18:11

First, I would ask why you are even dealing with null at all. I would evaluate the way I'm reading the data to ensure that doesn't happen.

Then I would note you can eliminate the null from your in-memory List before you even get to the RDD level like this for example:

data.flatMap(Option(_)).flatten

But if you must deal with null at the RDD level, you have options (no pun intended):

sparkContext.parallelize(data).filter(!_.contains(null))

or

sparkContext.parallelize(data).map(_.flatMap(Option(_))).filter(_.nonEmpty)

I prefer the latter. I don't like looking at null in Scala code.

I would stay away from a UDF-based solution since Spark can't optimize UDFs, and it is a shame to lose Spark's optimization capabilities over something as lame as null.

Spark dataset reduce with null values?

2 Answers2

Linked