0

When importing data from a MS SQL database, there is the potential for null values. In Spark, DataFrames are able to handle the null values. But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors.

Here's a simple example:

case class optionTest(var a: Option[Int], var b: Option[Int])

object testObject {
  def main(args: Array[String]): Unit = {
    import spark.implicits._
    val df = spark.sparkContext.parallelize(Seq(input)).toDF()

    val df2 = Seq((1, 3), (3, Option(null)))
                 .toDF("a", "b")
                 .as[optionTest]

    df2.show()
  }
}

Here is the error for this case:

No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"

What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame?

terminatur
  • 628
  • 1
  • 6
  • 21

2 Answers2

2

The problem is that your Dataframe doesn't match your case class.

Your first pair is an (Int, Int), and your second is an (Int, Option[Null]).

The easy thing to note is that if you want to represent an Option[Int], the value will be either Some(3), for example, or None for an absent value.

The tricky thing to note is that in Scala Int is a subclass of AnyVal while nullable references, which should be almost nonexistent in the Scala code you write, are on the AnyRef side of the Scala object hierarchy.

Because you have a bunch of objects that are all over the Scala object model, Spark has to treat your data as Any, the superclass of everything. There is no encoder that can handle that.

So with all that said, your data would have to look like this:

val df2 = Seq((Some(1), Some(3)), (Some(3), None))

As a side note, your case class should look like this:

case class OptionTest(a: Option[Int], b: Option[Int])

Vidya
  • 29,932
  • 7
  • 42
  • 70
  • This is helpful, but how then would you handle a null value in a DataFrame when going to a Dataset? – terminatur Mar 31 '17 at 16:45
  • Funny you should ask. I [answered a similar question](http://stackoverflow.com/a/43147122/1347281) earlier. – Vidya Mar 31 '17 at 18:12
0

If you wan to use Option you have to use it for all records. You should also use None instead of Option(null):

Seq((1, Some(3)), (3, None)).toDF("a", "b").as[optionTest]