6

After reading What is hive, Is it a database?, a colleague yesterday mentioned that he was able to filter a 15B table, join it with another table after doing a "group by", which resulted in 6B records, in only 10 minutes! I wonder if this would be slower in Spark, since now with the DataFrames, they may be comparable, but I am not sure, thus the question.

Is Hive faster than Spark? Or this question doesn't have meaning? Sorry, for my ignorance.

He uses the latest Hive, which from seems to be using Tez.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 1
    Put them on equivalent hardware and run comparable workloads. You'll know the answer. :) – Sergio Tulentsev Sep 09 '16 at 16:40
  • Correct @SergioTulentsev, but wouldn't that might be data-specific? I mean what I am trying to ask here, is something like [is Spark faster than Hadoop?](http://stackoverflow.com/questions/32572529/why-is-spark-faster-than-hadoop-map-reduce)..Because let's say I did the experiment, I would still wouldn't know why. I am trying to understand **theoretically** what would happen.. :) – gsamaras Sep 09 '16 at 16:43
  • 2
    Facebook has successfully ported a massive batch job from Hive to Spark. It took them **several months of debugging** (and 13 Spark JIRAs) **and tuning**. But now their job runs much faster. Are you up to the challenge?? https://code.facebook.com/posts/1671373793181703/apache-spark-scale-a-60-tb-production-use-case/ – Samson Scharfrichter Sep 09 '16 at 17:03
  • IBM tried to run a TPC-DS benchmark with Spark 2.0 at scale. But in the end they had to tweak a lot of configuration properties, both documented and undocumented, to make it through. Are you up to the challenge?? http://www.slideshare.net/jcmia1/apache-spark-20-tuning-guide/2 – Samson Scharfrichter Sep 09 '16 at 17:06
  • @SamsonScharfrichter there are some *really* cool links, thank you! I feel what the first says, when I tried to scale a pipeline we had to 15T. Thank you! – gsamaras Sep 09 '16 at 17:09
  • Sorry to add to your confusion, but you can run Hive on top of Spark as well (aka, use Spark as data processing engine for your queries). That approach will yield query latency in the same ballpark as that of Hive-on-Tez (while offering the opportunity to consolidate all your data processing onto the Spark API). Generally speaking, Hive and Spark SQL are intended for two different things and IMO they shouldn't be compared on a "performance" bases. – Justin Kestelyn Sep 09 '16 at 22:06
  • @JustinKestelyn you did the right thing to comment, thank you, I see your point, makes sense! :) – gsamaras Sep 09 '16 at 22:11

3 Answers3

4

Hive is just a framework that gives sql functionality to MapReduce type workloads.

These workloads can run on mapreduce or yarn.

So comparing Hive on tez vs Hive on spark. Nice article below discussing this When to go with ETL on Hive using Tez VS When to go with Spark ETL? (Gist use Hive on spark if not sure).

Benchmark information

Lower the better

Krishna Kalyan
  • 1,672
  • 2
  • 20
  • 43
  • 1
    Krishna thank you very much. Stackoverflow appreciates links, but sometimes these links die and the future users can't be helped. Would you be so kind as to update your answer with the *gist/intuition/basic idea* of the article? :) – gsamaras Sep 09 '16 at 16:54
  • @gsamaras thanks for the feedback. I will edit this answer. – Krishna Kalyan Sep 09 '16 at 17:08
  • 2
    Chart needs to be updated, as we now have Spark 2.0 with a lot of optimization - some queries runs about 100x faster, most queries about 10x faster than in Spark 1.x :) – T. Gawęda Sep 09 '16 at 17:39
  • @T.Gawęda good point! Shall you find something better, please post an answer! : – gsamaras Sep 09 '16 at 17:57
  • 1
    @gsamaras Yes I will write longer answer with focus on how Spark supports Hive, but tomorrow - in Poland there is a night now ;) – T. Gawęda Sep 09 '16 at 18:06
  • Can you change the line from *can run in mapreduce or yarn* to *can run on mapreduce or tez* – Madhusoodan P Nov 05 '18 at 12:13
4

Spark is convenient but does not handle scale all that well as regards SQL performance.

Hive has amazing support for co-partitioned joins. When the tables you were joining have hundreds of millions to billions of rows you will really appreciate the fine grained join support via:

  • similar distribute by and sort by (or cluster by)
  • bucketed joins

Hive has extensive support for metadata-only queries: Spark has only had a glimmer of it since 2.1

Spark runs out of steam quickly when the number of partitions exceeds maybe 10K+. Hive does not suffer from this limitation.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
1

Fast forward to 2018, Hive is much faster (and more stable) than SparkSQL, especially in concurrent environments, according to the following article:

https://mr3.postech.ac.kr/blog/2018/10/31/performance-evaluation-0.4/

The article compares several SQL-on-Hadoop systems using the TPC-DS benchmark (1TB, 3TB, 10TB) using three clusters (11 nodes, 21 nodes, 42 nodes):

  • Hive-LLAP included in HDP(Hortonworks Data Platform) 2.6.4
  • Hive-LLAP included in HDP 3.0.1
  • Presto 0.203e (with cost-based optimization enabled)
  • Presto 0.208e (with cost-based optimization enabled)
  • SparkSQL 2.2.0 included in HDP 2.6.4
  • SparkSQL 2.3.1 included in HDP 3.0.1
  • Hive 3.1.0 running on top of Tez
  • Hive on Tez included in HDP 3.0.1
  • Hive 3.1.0 running on top of MR3 0.4
  • Hive 2.3.3 running on top of MR3 0.4

So, in comparison with Hive-based systems and Presto, SparkSQL is very slow and does not scale in concurrent environments. (Note that the experiment uses SparkSQL running on vanilla Spark.)

glapark
  • 86
  • 3
  • I don't have an installation to check that now, so I can't say more, but others might find that useful. – gsamaras Nov 02 '18 at 10:25