0

I read

In Spark 1.6 a Dataset seems to be more like an improved DataFrame ("Conceptually Spark DataSet is just a DataFrame with additional type safety"). In Spark 2.0 it seems a lot more like an improved RDD. The former has a relational model, the later is more like a list. For Spark 1.6 it was said that Datasets are an extension of DataFrames, while in Spark 2.0 DataFrames are just Datasets containing a Type [Row], making DataFrames a special case of Datasets, making DataFrames a special case of Datasets. Now I am a little confused. Are Datasets in Spark 2.0 conceptually more like RDDs or like DataFrames? What is the conceptual difference between an RDD to a Dataset in Spark 2.0?

Community
  • 1
  • 1
Make42
  • 12,236
  • 24
  • 79
  • 155

1 Answers1

0

I thin they are very similar from an user-perspective, but are quite differently implemented under the hood. Dataset API now seems almost as flexible as RDD API, but adds the entire story of optimization (Catalyst & Tungsten)

Citing from http://www.agildata.com/apache-spark-2-0-api-improvements-rdd-dataframe-dataset-sql/

RDDs can be used with any Java or Scala class and operate by manipulating those objects directly with all of the associated costs of object creation, serialization and garbage collection.

Datasets are limited to classes that implement the Scala Product trait, such as case classes. There is a very good reason for this limitation. Datasets store data in an optimized binary format, often in off-heap memory, to avoid the costs of deserialization and garbage collection. Even though it feels like you are coding against regular objects, Spark is really generating its own optimized byte-code for accessing the data directly.

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145