I read
- What is the difference between Spark DataSet and RDD
- Difference between DataSet API and DataFrame
- http://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes
- https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
In Spark 1.6 a Dataset seems to be more like an improved DataFrame ("Conceptually Spark DataSet is just a DataFrame with additional type safety"). In Spark 2.0 it seems a lot more like an improved RDD. The former has a relational model, the later is more like a list. For Spark 1.6 it was said that Datasets are an extension of DataFrames, while in Spark 2.0 DataFrames are just Datasets containing a Type [Row]
, making DataFrames a special case of Datasets, making DataFrames a special case of Datasets. Now I am a little confused. Are Datasets in Spark 2.0 conceptually more like RDDs or like DataFrames? What is the conceptual difference between an RDD to a Dataset in Spark 2.0?