3

I'm using Spark Datasets to read in csv files. I wanted to make a polymorphic function to do this for a number of files. Here's the function:

def loadFile[M](file: String):Dataset[M] = {
    import spark.implicits._
    val schema = Encoders.product[M].schema
    spark.read
      .option("header","false")
      .schema(schema)
      .csv(file)
      .as[M]
}

The errors that I get are:

[error] <myfile>.scala:45: type arguments [M] do not conform to method product's type parameter bounds [T <: Product]
[error]     val schema = Encoders.product[M].schema
[error]                                  ^
[error] <myfile>.scala:50: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.
[error]       .as[M]
[error]          ^
[error] two errors found

I don't know what to do about the first error. I tried adding the same variance as the product definition (M <: Product), but then I get the error "No TypeTag available for M"

If I pass in the schema already produced from the encoder, I then get the error:

[error] Unable to find encoder for type stored in a Dataset 
sathya
  • 1,982
  • 1
  • 20
  • 37
kim
  • 301
  • 6
  • 17

1 Answers1

3

You need to require anyone calling loadFile[M] to provide evidence that there is such an encoder for M. You can do this by using context bounds on M which requires an Encoder[M]:

def loadFile[M : Encoder](file: String): Dataset[M] = {
  import spark.implicits._
  val schema = implicitly[Encoder[M]].schema
  spark.read
   .option("header","false")
   .schema(schema)
   .csv(file)
   .as[M]
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
  • Thanks! That definitely compiled, but I had some access problems and out of memory problem running my program, even if I don't call the function. I assume I can make my case class extend Encoder and it should work if I didn't have these other runtime problems? – kim Jul 26 '17 at 13:06
  • @kim This is a compile time requirement, this shouldn't affect the runtime at all. Perhaps something else is causing your code to OOM. – Yuval Itzchakov Jul 26 '17 at 13:08
  • I decided to get around the whole Encoder problem by not using Spark, but I did find this issue, which talks about [encoders for custom objects](https://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-dataset). I'll come back to figuring it out when I have some time. I'll mark this as my answer though since it got me on the right track. – kim Jul 28 '17 at 13:33