0

We have a MapR cluster with Spark version 2.0 We are trying to measure the performance difference of a Hive query which is currently running on TEZ engine and then running it on Spark-sql just by Writing the sql query in .hql file and then calling it via shell file.

Query contains lots of Join which will definitely create multiple stages and shuffling will happen in this Scenario what would be the most optimum choice.?

Is it true that Datasets in Spark is slower than Dataframes for performing Aggregations like groupBy, max,min, count..etc..

So in what all areas Dataframes perform better than Datasets and vice versa ..?

AJm
  • 993
  • 2
  • 20
  • 39

1 Answers1

0

In Spark 2.0, Dataset[Row] is a alias for Dataframe, so there should not be any performance issue.

Please see:

Paul Leclercq
  • 989
  • 2
  • 15
  • 26