1

Is there any industrial guideline on writing with either RDD or Dataset for Spark project?

So far what's obvious to me:

  • RDD, more type safety, less optimization (in the sense of Spark SQL)
  • Dataset, less type safety, more optimization

Which one is recommended in production code? Seems there's no such topic found in stackoverflow so far since Spark is prevalent in the past few years.

I can already foresee the majority of the community is with Dataset :), hence let me quote first a downvote for it from this answer (and please do share opinions against it):

Personally, I find statically typed Dataset to be the least useful: Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.

There are not as flexible as RDDs with only a small subset of types supported natively.

"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.

jack
  • 1,787
  • 14
  • 30

2 Answers2

1

Here is an excerpt from "Spark: The Definitive Guide" to answer this:

When to Use the Low-Level APIs?

You should generally use the lower-level APIs in three situations:

  • You need some functionality that you cannot find in the higher-level APIs; for example, if you need very tight control over physical data placement across the cluster.
  • You need to maintain some legacy codebase written using RDDs.
  • You need to do some custom shared variable manipulation

https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch12.html

In other words: If you don't come across these situations above, in general better use the higher-level API (Datasets/Dataframes)

d-xa
  • 514
  • 2
  • 7
0

RDD Limitations :

No optimization engine for input:

There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually.

This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan. We can use same code optimizer for R, Java, Scala, or Python DataFrame/Dataset APIs. It provides space and speed efficiency.

ii. Runtime type safety There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime. Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error. It helps detect errors at compile time and makes your code safe.

iii. Degrade when not enough memory The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.

iv. Performance limitation & Overhead of serialization & garbage collection Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows. Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost. Or we can persist the object in serialized form.

v. Handling structured data RDD does not provide schema view of data. It has no provision for handling structured data.

Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.

This was all in limitations of RDD in Apache Spark so introduced Dataframe and Dataset .

vaquar khan
  • 10,864
  • 5
  • 72
  • 96