1

i have a query like the below. This query runs in 15 seconds on Impala, but when I run the same on HIVE, it takes 10+ minutes. I have to join several other tables to this query (with similar joins as the below) and the total time it takes is more than 1 hour(sometimes it fails/gets stuck after an hour), but on Impala it runs within a minute.

Can you please tell me why this might be happening and how I might be able to optimize the below join on hive?

SELECT count(*)
FROM table_A A
LEFT JOIN table_B B ON cast(A.value AS decimal(5, 2)) BETWEEN B.fromvalue AND B.tovalue
AND A.date BETWEEN B.fromdate AND B.todate ;
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Sawan S
  • 87
  • 8
  • 1
    Does this answer your question? [How does impala provide faster query response compared to hive](https://stackoverflow.com/questions/16755599/how-does-impala-provide-faster-query-response-compared-to-hive) – Steven Aug 11 '21 at 15:18

2 Answers2

1

Hive is running basic map reduce tasks (which are very slow by nature).
Also, joins work best with equi-joins (colA = colB) and you are not doing any equi-joins. (between is colA >= colB and colA <= colC)

One of the main feature of Impala is to be really fast when reading data.

So basically, yes, Hive is slow compare to Impala, and there is not much you can do about it. That's just how it is.


Also, you are doing a count over a left join which means that, if there is not duplicates, the output will be the number of line in A. So maybe, you do not need the join ...

Steven
  • 14,048
  • 6
  • 38
  • 73
  • Thanks for your response. Yes, i need to have non-equi joins in my query. And i am doing a select * on the table and not count. I just put count here just as an example. Is there any way this query can be run faster on pyspark spark.sql? – Sawan S Aug 11 '21 at 19:07
  • @SawanS Of course it should be faster with spark. But spark also likes equi-joins. Just try it to see the result, we cannot do it for you, we don't have your data. – Steven Aug 11 '21 at 20:41
1

Check query plan and try to configure mapjoin.

Theta joins (non-equality joins) like yours are implemented using cross join + filter in Hive. In case of Map join it will work much faster.

See here how to configure map-join: https://stackoverflow.com/a/49154414/2700344.

Check query plan again and make sure the MapJoinOperator is used.

Even with mapjoin, Hive is slower than Impala, though can process bigger datasets.

leftjoin
  • 36,950
  • 8
  • 57
  • 116