I have some data stored as parquet files and case classes matching the data schema. Spark deals well with regular Product types so if I have
case class A(s:String, i:Int)
I can easily do
spark.read.parquet(file).as[A]
But from what I understand, Spark doesn't handle disjunction types, so when I have enums in my parquet, previously encoded as integers, and a scala representation like
sealed trait E
case object A extends E
case object B extends E
I cannot do
spark.read.parquet(file).as[E]
// java.lang.UnsupportedOperationException: No Encoder found for E
Makes sense so far, but then, probably too naively, I try
implicit val eEncoder = new org.apache.spark.sql.Encoder[E] {
def clsTag = ClassTag(classOf[E])
def schema = StructType(StructField("e", IntegerType, nullable = false)::Nil)
}
And I still get the same "No Encoder found for E" :(
My question at this point is, why is the implicit missing in scope? (or not recognized as an Encoder[E]) and even if it did, how would such an interface allow me to actually decode the data? I would still need to map the value to the proper case object.
I did read a related answer that says "TL;DR There is no good solution right now, and given Spark SQL / Dataset implementation, it is unlikely there will be one in the foreseeable future." But I'm struggling to understand why a custom Encoder couldn't do the trick.