0

Peers,

We need to standardize almost-SQL workload processing using Spark 2.1. We are presently debating three options: RDD, DataFrames, and SparkSQL. After a day's combing through stackoverlow, papers and the web I draw comparison below. I seek feedback on the table, and especially on performance and memory. Thanking in advance.

+---------------------------+------------------+-----------------+--------------------------------------+  
|          Feature          |       RDD        | Data Frame (DF) |              Spark SQL               |  
+---------------------------+------------------+-----------------+--------------------------------------+  
| First-class Spark citizen | Yes              | Yes             | Yes                                  |  
| Native? [4]               | Core abstraction | API             | Module                               |  
| Generation [5]            | 1st              | 2nd             | 3rd                                  |  
| Abstraction [4,5,         | Low-level API    | Data processing | SQL-based                            |  
| Ansi standard SQL         | None             | Some            | near-ansi [5]                        |  
| Optimization              | None             | Catalyst [9]    | Catalyst [9]                         |  
| Performance [3,4,8        | Mix views        | Mix views       | Mix Views                            |  
| Memory                    | ?                | ?               | ?                                    |  
| Programming speed         | Slow             | Fast            | Faster if dealing with SQL workloads |  
+---------------------------+------------------+-----------------+--------------------------------------+  
[3] Introducing DataFrames in Apache Spark for Large Scale Data Science by data bricks   
[4] Spark RDDs vs DataFrames vs SparkSQL by Hortonworks  
[5] A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets When to use them and why by data bricks  
[6] Introducing Apache Spark 2.0 by data bricks  
[7] Spark RDD vs Spark SQL Performance comparison using Spark Java APIs  
[8] Spark sql queries vs dataframe functions on Stackoverflow  
[9] Spark SQL: Relational Data Processing in Spark by data bricks, MIT, UC Berkeley

EDIT to explain how question is different and not a duplicate:

Thanks for reference to the sister question. While I see a detailed discussion and some overlap, I see minimal (no?):
(a) discussion on SparkSQL,
(b) comparison on memory consumption of the three approaches, and
(c) performance comparison on Spark 2.x (updated in my question). It cites [4] (useful), which is based on spark 1.6

I argue my revised question is still unanswered. Requesting to unflag as a duplicate.

Dr.Rizz
  • 71
  • 1
  • 9
  • Thanks for reference to the sister question. While I see a detailed discussion and some overlap, I see minimal (no?): (a) discussion on SparkSQL, (b) comparison on memory consumption of the three approaches, and (c) performance comparison on Spark 2.x (updated in my question). It cites [4] (useful), which is based on spark 1.6 I argue my revised question is still unanswered. :-) – Dr.Rizz Jun 13 '18 at 14:06

1 Answers1

1

My personal opinion:

  • In terms of performance, you should use Dataframes/Datasets or Spark SQL. RDD is not optimized by Catalyst Optimizer and Tungsten project.
  • In terms of flexibility, I think use of Dataframe API will give you more readability and is much more dynamic than SQL, specially using Scala or Python, although you can mix them if you prefer
  • I would use SQL only if you want to migrate Hive workloads or if you connect via ODBC to spark thrift server from BI tools
gasparms
  • 3,336
  • 22
  • 26
  • 2
    Readability is subjective, I find SQLs to be well understood by broader user base than any API. – am5 Aug 28 '19 at 19:22