When importing data from a MS SQL database, there is the potential for null values. In Spark, DataFrames are able to handle the null values. But when I try to convert the DataFrame to a strongly typed Dataset, I receive encoder errors.
Here's a simple example:
case class optionTest(var a: Option[Int], var b: Option[Int])
object testObject {
def main(args: Array[String]): Unit = {
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(input)).toDF()
val df2 = Seq((1, 3), (3, Option(null)))
.toDF("a", "b")
.as[optionTest]
df2.show()
}
}
Here is the error for this case:
No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
java.lang.UnsupportedOperationException: No Encoder found for Any
- field (class: "java.lang.Object", name: "_2")
- root class: "scala.Tuple2"
What is the recommended approach to handle nullable values when creating a Dataset from a DataFrame?