Why does Some(null) throw NullPointerException in Spark 2.4 (but worked in 2.2)?

Question

This code has worked in the past under Spark 2.2 Scala 2.11.x, but does not in Spark 2.4.

val df = Seq(
  (1, Some("a"), Some(1)),
  (2, Some(null), Some(2)),
  (3, Some("c"), Some(3)),
  (4, None, None)
).toDF("c1", "c2", "c3")

I ran it in Spark 2.4 and it now gives the error:

scala> spark.version
res0: String = 2.4.0

scala> :pa
// Entering paste mode (ctrl-D to finish)

val df = Seq(
  (1, Some("a"), Some(1)),
  (2, Some(null), Some(2)),
  (3, Some("c"), Some(3)),
  (4, None, None)
).toDF("c1", "c2", "c3")

// Exiting paste mode, now interpreting.

java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._1 AS _1#6
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) AS _2#7
unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._3) AS _3#8
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
  at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
  at scala.collection.immutable.List.foreach(List.scala:388)
  at scala.collection.TraversableLike.map(TraversableLike.scala:233)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
  at scala.collection.immutable.List.map(List.scala:294)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
  ... 57 elided
Caused by: java.lang.NullPointerException
  at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
  ... 66 more

I'm curious what has changed and why replacing the line:

(2, Some(null), Some(2)),

with:

(2, None, Some(2)),

resolves the issue.

What has changed and does it mean for existing code base?

I get the NullPointerException in Spark 2.3.0. I don't believe the behaviour has changed (I don't use DataFrames much, but I've been burned by having a `Some(null)` in the past). — hoyland, Feb 24 '19 at 11:54
`Option(null)` is `None`, `Some(null)` isn't. If you have a Java API that can return `null`, use `Option` instead of `Some`. More info https://stackoverflow.com/questions/5796616/why-somenull-isnt-considered-none — ariskk, Feb 24 '19 at 19:34
I just realised I misunderstood your question. Apologies for that. I guess something changed in the `ExpressionEncoder` maybe [this](https://github.com/apache/spark/commit/c5583fdcd2289559ad98371475eb7288ced9b148#diff-90b107e5c61791e17d5b4b25021b89fd). Could be unintentional. The maintainers should know more. In the meantime, depending on how you generate your DF, something like `Some(null).flatMap(Option(_))` could solve your problem. Sorry again for jumping in. — ariskk, Feb 24 '19 at 20:35
Regardless of the root cause, I'd report it as an issue in JIRA @ https://issues.apache.org/jira/projects/SPARK as NPE in such a low-level code is certainly a bug and should not be seen by an end user. — Jacek Laskowski, Feb 24 '19 at 20:52

score 3 · Accepted Answer · edited Feb 25 '19 at 10:57

3

Considered a bug and reported as SPARK-26984.

edited Feb 25 '19 at 10:57

Jacek Laskowski

72,696
27
242
420

answered Feb 25 '19 at 07:12

thebluephantom

16,458
8
40
83

Why does Some(null) throw NullPointerException in Spark 2.4 (but worked in 2.2)?

1 Answers1