0

Background

It is pretty easy to read about all of the benefits of Datasets in Spark including compile-time error checking, performance optimizations, etc.

When you start coding with Datasets though, you quickly find that many operations return a Data Frame instead of a Dataset. For example, dataSet.select("a", "b") will return a Data Frame (at least in many cases it will).

My Question

You can easily change a Data Frame to a Dataset after the select by using a case class and .as[CaseClassName]. Doing this after every transformation seems a little painful though.

Is converting with "as" after every transformation the proper way to stick to using Datasets? Or am I approaching this wrong in general?

John Humphreys
  • 37,047
  • 37
  • 155
  • 255
  • For the record - there are no _performance optimizations_ - `Dataset` of anything other than `Row` is typically much slower than `Dataset[Row]`. _Is converting with "as" after every transformation the proper way to stick to using Datasets?_ - no. The "correct" way is not using `DataFrame` methods. You either have dynamism, performance and expressiveness of `DataFrame`or somewhat compile-safe `Dataset`. You can't have your cake and eat it. Every time you call `as` things are no longer even remotely safe. – Alper t. Turker Apr 08 '18 at 23:58
  • 1
    @user9613318 Can I have a source on that? Dynamic typing is generally relatively slow. Knowing explicit types lets you store things far more optimally and lets you do far more efficient calculations in general. I also thought project Tungsten/etc were specifically coded to leverage the extra type data for performance optimizations. – John Humphreys Apr 09 '18 at 00:50
  • 1
    Start with [Spark 2.0 Dataset vs DataFrame](https://stackoverflow.com/q/40596638/9613318). Also Week 4 - Datasets lecture in [Big Data Analysis with Scala and Spark](https://stackoverflow.com/q/40596638/9613318). Dynamism in DataFrame has very little to do with dynamic typing. What you really have there is a small compiler for SQL-like language. It is performance comes from a restricted semantics. Arbitrary `Dataset` is more expressive therefore doesn't allow the same range of optimizations. – Alper t. Turker Apr 09 '18 at 10:36
  • Cool, will check out the links. Thanks for the follow up :). By the way, I had the question because "Mastering Apache Spark 2.x" the textbook seems to imply a performance benefit, but I couldn't specifically determine what it was. So maybe it's miss-leading. – John Humphreys Apr 09 '18 at 13:04
  • The only performance benefit I am aware of is `Encoder` behavior. Generic `RowEncoder` is pretty slow, compared to `ProductEncoders`. – Alper t. Turker Apr 09 '18 at 14:19

0 Answers0