Under the hood SPARK Dataframe optimization

Question

Leaving aside the database connection aspects that get discussed with mapPartitions for RDDs, and noting that for me the Dataframe under the hood is harder to follow than the RDD abstraction:

Is the performance of a DF now that good, that we would never need to convert to an RDD from a DF, so as to use mapPartitions for the sake of performance in terms of processing?

I didn't understand the first part. The answer that you are looking for 2nd question is partially available in this link https://stackoverflow.com/a/29012187/3213772 — pushpavanthar, Jul 02 '18 at 08:59
You mean bullet 1: if so: mapPartitions is seen as a performance booster for RDDs. If DF's are so good, then how does the underhood work to equate to better than performance of an RDD using mapPartitions? — thebluephantom, Jul 02 '18 at 09:02
@puru But that link is about RDDs. I get that. I am wondering what it means for when a Data Frame is loaded. All this under the hood optimization - not clear how default partitioning applies to a DF. — thebluephantom, Jul 02 '18 at 09:51

score 1 · Accepted Answer · answered Jul 03 '18 at 07:59

From Spark 2.0 onwards the Dataframe is a Dataset organized into named columns. To answer your question, there is no need for Dataframes to be converted back to RDDs to achieve performance and optimization, because, Datasets and Dataframes themselves are very efficient compared to primitive RDDs due to below reasons.

They are built on top of Spark SQL engine, which uses Catalyst Optimizer that leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) to generate an optimized logical and physical query plan. Whereas the Dataset[T] typed API is optimized for data engineering tasks, the untyped Dataset[Row] (an alias of DataFrame) is even faster and suitable for interactive analysis.
Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

I have read those things but find them a bit thin. I am wondering if it is true, as I have seen some posts otherwise. That said, there are so many variable in a multi-use system, so I am going to for the moment assume this is true. Not everything is relational. — thebluephantom, Jul 03 '18 at 08:33

Under the hood SPARK Dataframe optimization

1 Answers1