0

What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields. I referred "https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.

zero323
  • 322,348
  • 103
  • 959
  • 935
Shubhanshu
  • 164
  • 2
  • 11
  • 1
    See [Spark 2.0 Dataset vs DataFrame](https://stackoverflow.com/questions/40596638/spark-2-0-dataset-vs-dataframe) and [Difference between DataSet API and DataFrame AP](https://stackoverflow.com/q/37301226/6910411) – zero323 Oct 25 '18 at 11:10

1 Answers1

4

inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode

val df = loadDFWithSchemaInference
//doing things that takes time
df.map(row => row.getAs[String]("fieldName")).//more stuff

now in your this code you already have the assumption baked in that fieldName is of type String but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String

now if you'd do this instead

val df = load.as[CaseClass]

or

val df = load.option("schema", predefinedSchema)

the fact that fieldName is a String will be a precondition and thus your code will be more robust and less error prone.

schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.

Edit Addendum: I personally prefer to use case classes over schemas because I prefer the Dataset API to the Dataframe API (which is Dataset[Row]) for similar robustness reasons.

Dominic Egger
  • 1,016
  • 5
  • 7
  • Thanks Dominic for the insight, so defining schema of type Structtype is also good for operational code. What if the schema contains 100s of columns. Do I need to define that manually? – Shubhanshu Oct 25 '18 at 10:00
  • at some point probably yes unless you can derive it from something that already exists. also there's the option of just defining a case class with the fields you actually need and drop the rest on load – Dominic Egger Oct 25 '18 at 10:27