1

Given that Avro and Parquet files contain both the data and the schema for that data, then in Spark, it should be possible to read these files in as Dataset rather than DataFrame. But all the sources that I see are reading these files in as DataFrame, and I can't find anyway of reading these files as Dataset.

Does anyone know how to read these files as Datasets?

user1888243
  • 2,591
  • 9
  • 32
  • 44
  • `DataFrame` is a `Dataset` (`Dataset[Row]` to be precise). – Alper t. Turker May 20 '18 at 19:48
  • 1
    True, but a real Dataset knows what is the type of each column. Whereas DataFrame has no idea what the type of each field is; it only knows that it holds a collection of Rows. – user1888243 May 20 '18 at 22:39
  • That's not true. In both cases `Dataset` "knows" exactly the same amount of information about the column types. That's what specifically allows for Catalyst optimizations. The difference is how much information it can provide to the compiler. – Alper t. Turker May 21 '18 at 04:52
  • No, it is not. You should go back and read the fundamentals. There is a reason that a distinction is made, and the reason is that with DataFrame each row is looked at just as a Row, and not as the fields that the Row is made of. With a pure Dataset, Spark can perform better optimization. – user1888243 May 21 '18 at 14:24
  • With "pure Dataset" (meaning outside `DataFrame` / SQL API) Spark can apply [fewer optimizations](https://stackoverflow.com/q/40596638/9613318). "Binary" `Encoders` are the only performance improvement (compared to `RowEncoder`). Also, representation doesn't change. Internal storage stays exactly the same, if you have `ds: Dataset[Row]` and child `ds.as[T]: Dataset[T]`. – Alper t. Turker May 21 '18 at 14:32
  • And more [Big Data Analysis with Scala and Spark - Datasets](https://www.coursera.org/learn/scala-spark-big-data/lecture/yrfPh/datasets) – Alper t. Turker May 21 '18 at 14:36

1 Answers1

1
def readParquet(spark: SparkSession): Unit = {
  import org.apache.spark.sql._
  import spark.implicits._
  import Test._

  spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false") // This is optional, only if you face any spark parquet decoders

  val schema = Encoders.product[TestData].schema
  val ds =
    spark.read
      .schema(schema)
      .parquet("data.parquet")
      .as[TestData]

  ds.show(false)
}

object Test {
  case class TestData(id: Int, name: String, usedAmount: Double)
}
Tawkir
  • 1,176
  • 7
  • 13