Is there any industrial guideline on writing with either RDD
or Dataset
for Spark project?
So far what's obvious to me:
RDD
, more type safety, less optimization (in the sense of Spark SQL)Dataset
, less type safety, more optimization
Which one is recommended in production code? Seems there's no such topic found in stackoverflow so far since Spark is prevalent in the past few years.
I can already foresee the majority of the community is with Dataset
:), hence let me quote first a downvote for it from this answer (and please do share opinions against it):
Personally, I find statically typed Dataset to be the least useful: Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.
There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.