9

I am new to Hadoop Hive and I am developing a reporting solution. The problem is that the query performance is really slow (hive 0.10, hbase 0.94, hadoop 1.1.1). One of the queries is:

select a.*, b.country, b.city from p_country_town_hotel b 
    inner join p_hotel_rev_agg_period a  on
    (a.key.hotel = b.hotel) where b.hotel = 'AdriaPraha' and a.min_date < '20130701'
    order by a.min_date desc  
    limit 10;

which takes quite a long time (50s). I know I know, the join is on string field and not on integer but the data sets are not big(cca 3300 and 100000 records). I tried hints on this SQL but that didn't turn out any faster. The same query on MS SQL Server lasts 1s. Also a simple count(*) from table lasts 7-8s which is shocking (the table has 3300 records). I really don't know what is the issue? Any ideas or did I misinterpret Hadoop?

Undo
  • 25,519
  • 37
  • 106
  • 129
user2346868
  • 103
  • 1
  • 2
  • 6

4 Answers4

17

Yes..you have misinterpreted Hadoop. Hadoop, and Hive as well, are not meant for real time stuff. They are most suitable for offline, batch processing kinda stuff. They are not at all a replacement to RDBMSs. Though you can do some fine tuning but 'absolute real time' is not possible. There a lot of things which happen under the hood when you run a hive query, which I think you are not unaware of. First of all you Hive query gets converted into a corresponding MR job followed by few other things like split creation, records generation, mapper generation etc. I would never suggest Hadoop(or Hive) if you have real time needs.

You might wanna have a look at Impala for your real time needs.

Tariq
  • 34,076
  • 8
  • 57
  • 79
  • Do not use impala for analysis of large data. check [this](https://gigaom.com/2014/01/13/cloudera-says-impala-is-faster-than-hive-which-isnt-saying-much/). Even current version (0.15) is much faster than what is using in this blog. Check [this blog](http://hortonworks.com/blog/benchmarking-apache-hive-13-enterprise-hadoop/). – chanchal1987 Nov 19 '15 at 21:30
3

Hive is not the appropriate tool for a real-time job, but if you want to leverage the Hadoop infrastructure with real-time or fast data access take a look at HBase. It's value-add is all about fast access. Not sure why you are selecting Hadoop for your solution, but Hbase sits on top of HDFS which some people like because of the inherent redundancy HDFS offers (you copy a file on there once and it is auto-replicated) which may be one of the reasons you are looking into Hadoop.

For more info: read this question

Community
  • 1
  • 1
Engineiro
  • 1,146
  • 7
  • 10
  • First of all, thank you guys for helping me out :). Currently this is a proof of concept so we have a small amount of data. But in real life we'll have tens of millions of rows and that is why we chose to try out Hadoop Hive. And also it's free :). For the reporting purposes I **need** fast reports (ideally it would be couple of seconds) so I would like to use something fast (I'll take a look at this Impala). That is why I was surprised at the speed of Hadoop Hive because I expected it had to be fast as it is intended for big data.. – user2346868 May 06 '13 at 07:48
1

I am not sure how new you are to hadoop.Hive does not give you results at interactive speeds how small the tables are.In case you knew this already and trying to tune the query, you can try below:

select a.*, b.country, b.city from
 (select * from p_country_town_hotel where hotel= 'AdriaPraha') b 
 inner join 
 (select * from p_hotel_rev_agg_period where min_date < '20130701') a  
 on
 a.key.hotel = b.hotel
 order by a.min_date desc  
 limit 10;

If you know one of the tables is small enough to fit in memory, you can try map side join.

1

use http://phoenix.apache.org/ for real-time queries like this

alex
  • 1,757
  • 4
  • 21
  • 32