Peers,
We need to standardize almost-SQL workload processing using Spark 2.1. We are presently debating three options: RDD, DataFrames, and SparkSQL. After a day's combing through stackoverlow, papers and the web I draw comparison below. I seek feedback on the table, and especially on performance and memory. Thanking in advance.
+---------------------------+------------------+-----------------+--------------------------------------+ | Feature | RDD | Data Frame (DF) | Spark SQL | +---------------------------+------------------+-----------------+--------------------------------------+ | First-class Spark citizen | Yes | Yes | Yes | | Native? [4] | Core abstraction | API | Module | | Generation [5] | 1st | 2nd | 3rd | | Abstraction [4,5, | Low-level API | Data processing | SQL-based | | Ansi standard SQL | None | Some | near-ansi [5] | | Optimization | None | Catalyst [9] | Catalyst [9] | | Performance [3,4,8 | Mix views | Mix views | Mix Views | | Memory | ? | ? | ? | | Programming speed | Slow | Fast | Faster if dealing with SQL workloads | +---------------------------+------------------+-----------------+--------------------------------------+
[3] Introducing DataFrames in Apache Spark for Large Scale Data Science by data bricks
[4] Spark RDDs vs DataFrames vs SparkSQL by Hortonworks
[5] A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets When to use them and why by data bricks
[6] Introducing Apache Spark 2.0 by data bricks
[7] Spark RDD vs Spark SQL Performance comparison using Spark Java APIs
[8] Spark sql queries vs dataframe functions on Stackoverflow
[9] Spark SQL: Relational Data Processing in Spark by data bricks, MIT, UC Berkeley
EDIT to explain how question is different and not a duplicate:
Thanks for reference to the sister question. While I see a detailed discussion and some overlap, I see minimal (no?):
(a) discussion on SparkSQL,
(b) comparison on memory consumption of the three approaches, and
(c) performance comparison on Spark 2.x (updated in my question). It cites [4] (useful), which is based on spark 1.6
I argue my revised question is still unanswered. Requesting to unflag as a duplicate.