DataSet gives the best performance than dataframe. DataSet provide Encoders and type-safe but dataframe still in usage is there any particular scenario only dataframe is used in that scenario or is there any function which is working on dataframe and not working in dataset.
-
This is a good point of view, but sadly, there is still too much Spark Functionality that is built with the Dataframe as main api like the Spark ML. Take a look to https://typelevel.org/frameless/. – Emiliano Martinez Jan 03 '19 at 10:26
-
i don't know why people mark as duplicate without understanding what I am asking.@user6910411 I didn't ask the difference between dataframe and dataset. – C Kondaiah Jan 03 '19 at 17:35
-
@EmiCareOfCell44 i don't about the MLIB ...isn't dataset available in Spark ML. – C Kondaiah Jan 03 '19 at 17:40
-
Take a look to the Spark ML stages, like transformers and estimators. All of them work with the Dataframe type, Dataset[Row]. And if you go with custom transformers or other advanced features it´s not trivial to abstract over them – Emiliano Martinez Jan 03 '19 at 19:33
1 Answers
Dataframe
is actually a Dataset[Row]
.
It also has many tools and functions associated with it which enables working with the Row
as opposed to a generic Dataset[SomeClass]
This gives DataFrame
the immediate advantage of being able to use these tools and functions without having to write them yourself.
DataFrame
actually enjoys better performance than Dataset
. The reason for this is that Spark can understand the internals of the built-in functions associated with DataFrame
and this enables the Catalyst optimization (rearrange and change the execution tree) as well as performing wholestage codegen to avoid a lot of the virtualization.
Furthermore, when writing Dataset functions
, the relevant object type (e.g. case class) need to be constructed (which includes copying). This can be a overhead depending on the usage.
Another advantage of Dataframe
is that it's schema is set at run time rather than at compile time. This means that if you read for example from a parquet file, the schema would be set by the content of the file. This enables to handle dynamic cases (e.g. to perform ETL)
There are probably more reasons and advantages but I think those are the important ones.

- 1,985
- 12
- 27

- 12,701
- 5
- 47
- 56
-
1In case you use HDFS(parquet..) you have the schema, but if you don´t you must include it. And having the schema at runtime leads to runtime errors, that you can not detect at compile time I don´t think that is any kind of advantage. – Emiliano Martinez Jan 03 '19 at 11:13
-
1@EmiCareOfCell44 ETL is a standard use for spark. You do not necessarily know the schema. This is also true when having additional, extended fields. Because you can't really have "AnyValue" or an abstract class as a member, you would have problems with any but the strictest schema definitions. There are more use cases for this than I can count... – Assaf Mendelson Jan 03 '19 at 11:18
-
Problems can arise indeed. But I would prefer having the compile phase to help me with these schema changes and avoiding Spark to detect the errors in some Spark SQL function call. There much to do in Spark in this field – Emiliano Martinez Jan 03 '19 at 11:54