Spark Dataset or Dataframe for Aggregation

Question

We have a MapR cluster with Spark version 2.0 We are trying to measure the performance difference of a Hive query which is currently running on TEZ engine and then running it on Spark-sql just by Writing the sql query in .hql file and then calling it via shell file.

Query contains lots of Join which will definitely create multiple stages and shuffling will happen in this Scenario what would be the most optimum choice.?

Is it true that Datasets in Spark is slower than Dataframes for performing Aggregations like groupBy, max,min, count..etc..

So in what all areas Dataframes perform better than Datasets and vice versa ..?

score 0 · Answer 1 · answered Oct 17 '17 at 20:24

0

In Spark 2.0, Dataset[Row] is a alias for Dataframe, so there should not be any performance issue.

Please see:

answered Oct 17 '17 at 20:24

Paul Leclercq

989
2
15
26

So does that mean Spark Datasets & Dataframe are very much similar in performance in all aspects ..? – AJm Oct 17 '17 at 21:22
@Aijaz yes Dataset = Dataframe + type safety – Paul Leclercq Oct 18 '17 at 13:20

Spark Dataset or Dataframe for Aggregation

1 Answers1