Hive queries are constantly failing. How to optimally join very big tables?

Question

I'm trying to (inner) join two tables on HDFS partitioned by 'day' (date) for multiple days (say 2 weeks). Both tables have 100s of columns, but I'm only trying to query 10s of them. Each day has more than a billion rows.

My HIVE query looks like following.

INSERT OVERWRITE TABLE join1 partition (day) 
SELECT a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day 
from (
    select x1, x2, x3... xn 
    from table1 
    where day between day1 and day2
) a 
join (
    select x1, y1, y2,... ym, day 
    from table2 where day between day1 and day2
) b 
on a.x1=b.x1;

First problem- it takes a real long time (12+ hours) to do this join even for smaller period (1-7 days).

Second problem- it fails every time I try to do it for more than 10 days or so. It uses around 504 mappers and 250 reducers which is the default (I've also tried with 500 reducers).

I know this error is not real (What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask), but even the real error wasn't very useful (sorry I can't get it now).

What could be the reason for this crashing? Can anyone suggest a better way to join such huge tables?

Try running your query with TEZ execution engine. And adding `-hiveconf hive.root.logger=DEBUG,console` can give you more details about the error. — Ambrish, Oct 01 '16 at 14:06

score 0 · Answer 1 · answered Oct 01 '16 at 01:49

This is too long for a comment.

Some databases have problems when optimizing subqueries. I could imagine that this is a problem with Hive. So, I would recommend:

select a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day 
from table1 a join
     table2 b 
     on a.x1 = b.x1
where a.day between a.day1 and a.day2 and
      b.day between b.day1 and b.day2;

I also wonder if you want a condition a.day = b.day in the on clause. Using the existing partitioning key in the join should help performance.

ozw1z5rd · Answer 2 · 2016-10-01T10:53:55.573

About the error:

Since you are using dynamic partitioning on join1, did you set correctly the max number of partition which can be created?

About the speed:

Are yours table1 and table2 defined like this ?

CREATE table1 ( 
  x1 string,
  x2 string,
  :
) PARTITIONED BY ( day int ) 
CLUSTERED BY ( 'x1' ) 
SORTED BY ( x1 ) INTO 400 BUCKETS;

This table is partitioned by day, so that accessing any day requires only access the corresponding partition and not the whole file. This will speed up your inner queries. It uses also bucketing, so when you are doing joins on x1, all the rows with the same x1 values are sticked together in the same place, this will speed up your join, don't mind a such great delta. Only if join is made at Map stage ( thanks to bucketing ) the difference is visible.

Hive queries are constantly failing. How to optimally join very big tables?

2 Answers2