1

This code has worked in the past under Spark 2.2 Scala 2.11.x, but does not in Spark 2.4.

val df = Seq(
  (1, Some("a"), Some(1)),
  (2, Some(null), Some(2)),
  (3, Some("c"), Some(3)),
  (4, None, None)
).toDF("c1", "c2", "c3")

I ran it in Spark 2.4 and it now gives the error:

scala> spark.version
res0: String = 2.4.0

scala> :pa
// Entering paste mode (ctrl-D to finish)

val df = Seq(
  (1, Some("a"), Some(1)),
  (2, Some(null), Some(2)),
  (3, Some("c"), Some(3)),
  (4, None, None)
).toDF("c1", "c2", "c3")

// Exiting paste mode, now interpreting.

java.lang.RuntimeException: Error while encoding: java.lang.NullPointerException
assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._1 AS _1#6
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, unwrapoption(ObjectType(class java.lang.String), assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._2), true, false) AS _2#7
unwrapoption(IntegerType, assertnotnull(assertnotnull(input[0, scala.Tuple3, true]))._3) AS _3#8
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:293)
  at org.apache.spark.sql.SparkSession.$anonfun$createDataset$1(SparkSession.scala:472)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
  at scala.collection.immutable.List.foreach(List.scala:388)
  at scala.collection.TraversableLike.map(TraversableLike.scala:233)
  at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
  at scala.collection.immutable.List.map(List.scala:294)
  at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:472)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:377)
  at org.apache.spark.sql.SQLImplicits.localSeqToDatasetHolder(SQLImplicits.scala:228)
  ... 57 elided
Caused by: java.lang.NullPointerException
  at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:289)
  ... 66 more

I'm curious what has changed and why replacing the line:

(2, Some(null), Some(2)),

with:

(2, None, Some(2)),

resolves the issue.

What has changed and does it mean for existing code base?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    I get the NullPointerException in Spark 2.3.0. I don't believe the behaviour has changed (I don't use DataFrames much, but I've been burned by having a `Some(null)` in the past). – hoyland Feb 24 '19 at 11:54
  • 2
    `Option(null)` is `None`, `Some(null)` isn't. If you have a Java API that can return `null`, use `Option` instead of `Some`. More info https://stackoverflow.com/questions/5796616/why-somenull-isnt-considered-none – ariskk Feb 24 '19 at 19:34
  • Works under 2.2 though. @ariskk – thebluephantom Feb 24 '19 at 19:43
  • 1
    I just realised I misunderstood your question. Apologies for that. I guess something changed in the `ExpressionEncoder` maybe [this](https://github.com/apache/spark/commit/c5583fdcd2289559ad98371475eb7288ced9b148#diff-90b107e5c61791e17d5b4b25021b89fd). Could be unintentional. The maintainers should know more. In the meantime, depending on how you generate your DF, something like `Some(null).flatMap(Option(_))` could solve your problem. Sorry again for jumping in. – ariskk Feb 24 '19 at 20:35
  • @ariskk No issue. – thebluephantom Feb 24 '19 at 20:37
  • 3
    Regardless of the root cause, I'd report it as an issue in JIRA @ https://issues.apache.org/jira/projects/SPARK as NPE in such a low-level code is certainly a bug and should not be seen by an end user. – Jacek Laskowski Feb 24 '19 at 20:52
  • @JacekLaskowski May be I should leave that to you? – thebluephantom Feb 25 '19 at 07:11

1 Answers1

3

Considered a bug and reported as SPARK-26984.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
thebluephantom
  • 16,458
  • 8
  • 40
  • 83