why DataFrame still there in spark 2.2 also even DataSet gives more performance in scala?

Question

DataSet gives the best performance than dataframe. DataSet provide Encoders and type-safe but dataframe still in usage is there any particular scenario only dataframe is used in that scenario or is there any function which is working on dataframe and not working in dataset.

This is a good point of view, but sadly, there is still too much Spark Functionality that is built with the Dataframe as main api like the Spark ML. Take a look to https://typelevel.org/frameless/. — Emiliano Martinez, Jan 03 '19 at 10:26
i don't know why people mark as duplicate without understanding what I am asking.@user6910411 I didn't ask the difference between dataframe and dataset. — C Kondaiah, Jan 03 '19 at 17:35
@EmiCareOfCell44 i don't about the MLIB ...isn't dataset available in Spark ML. — C Kondaiah, Jan 03 '19 at 17:40
Take a look to the Spark ML stages, like transformers and estimators. All of them work with the Dataframe type, Dataset[Row]. And if you go with custom transformers or other advanced features it´s not trivial to abstract over them — Emiliano Martinez, Jan 03 '19 at 19:33

score 5 · Answer 1 · edited Aug 14 '19 at 23:00

5

Dataframe is actually a Dataset[Row]. It also has many tools and functions associated with it which enables working with the Row as opposed to a generic Dataset[SomeClass]

This gives DataFrame the immediate advantage of being able to use these tools and functions without having to write them yourself.

DataFrame actually enjoys better performance than Dataset. The reason for this is that Spark can understand the internals of the built-in functions associated with DataFrame and this enables the Catalyst optimization (rearrange and change the execution tree) as well as performing wholestage codegen to avoid a lot of the virtualization.

Furthermore, when writing Dataset functions, the relevant object type (e.g. case class) need to be constructed (which includes copying). This can be a overhead depending on the usage.

Another advantage of Dataframe is that it's schema is set at run time rather than at compile time. This means that if you read for example from a parquet file, the schema would be set by the content of the file. This enables to handle dynamic cases (e.g. to perform ETL)

There are probably more reasons and advantages but I think those are the important ones.

edited Aug 14 '19 at 23:00

Bartosz Konieczny

1,985
12
27

answered Jan 03 '19 at 10:57

Assaf Mendelson

12,701
5
47
56

1

In case you use HDFS(parquet..) you have the schema, but if you don´t you must include it. And having the schema at runtime leads to runtime errors, that you can not detect at compile time I don´t think that is any kind of advantage. – Emiliano Martinez Jan 03 '19 at 11:13
1

@EmiCareOfCell44 ETL is a standard use for spark. You do not necessarily know the schema. This is also true when having additional, extended fields. Because you can't really have "AnyValue" or an abstract class as a member, you would have problems with any but the strictest schema definitions. There are more use cases for this than I can count... – Assaf Mendelson Jan 03 '19 at 11:18
Problems can arise indeed. But I would prefer having the compile phase to help me with these schema changes and avoiding Spark to detect the errors in some Spark SQL function call. There much to do in Spark in this field – Emiliano Martinez Jan 03 '19 at 11:54

why DataFrame still there in spark 2.2 also even DataSet gives more performance in scala?

1 Answers1