I'm trying to (inner) join two tables on HDFS partitioned by 'day' (date) for multiple days (say 2 weeks). Both tables have 100s of columns, but I'm only trying to query 10s of them. Each day has more than a billion rows.
My HIVE query looks like following.
INSERT OVERWRITE TABLE join1 partition (day)
SELECT a.x1, a.x2, a.x3... a.xn, b.y1, b.y2.... b.ym, b.day
from (
select x1, x2, x3... xn
from table1
where day between day1 and day2
) a
join (
select x1, y1, y2,... ym, day
from table2 where day between day1 and day2
) b
on a.x1=b.x1;
First problem- it takes a real long time (12+ hours) to do this join even for smaller period (1-7 days).
Second problem- it fails every time I try to do it for more than 10 days or so. It uses around 504 mappers and 250 reducers which is the default (I've also tried with 500 reducers).
I know this error is not real (What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask), but even the real error wasn't very useful (sorry I can't get it now).
What could be the reason for this crashing? Can anyone suggest a better way to join such huge tables?