spark Dataset: how to get encoder for custom case class?

Question

I am trying to write a generic method which can create Dataset, with client supplying data file name, fileformat, and 'something' which can represent input case class for schema. I tried with:

def dataSetFromFileAndCaseClass[T](spark: SparkSession, fileName: String, schema: ClassTag[T], fileFormat: String) = {
    import spark.implicits._
    fileFormat match {
      case "csv" => spark.read.csv(fileName).as[schema]
      case "json" => spark.read.json(fileName).as[schema]
      case _ => throw new Exception("File format not supported")
    }
  }

...and it doesn't work as I expected :).

'as' is defined like:

def as[U : Encoder]: Dataset[U] = Dataset[U](sparkSession, logicalPlan)

So what I understand, 'as' is expecting an implicit Encoder for whatever case class client would like to provide.

So schema must somehow bring in Encoder for whatever case class client would call dataSetFromFileAndCaseClass method to work for.

How do modify 'dataSetFromFileAndCaseClass' signature to get it working?

Perhaps this might help: import org.apache.spark.sql.Encoders val mySchema = Encoders.product[MyCaseClass].schema — Khalid Mammadov, Aug 13 '22 at 08:10
Also, as a work around you might want to set schema via **schema** method as in here https://stackoverflow.com/questions/39926411/provide-schema-while-reading-csv-file-as-a-dataframe — Khalid Mammadov, Aug 13 '22 at 08:15

spark Dataset: how to get encoder for custom case class?

0 Answers0