11

I am getting this error while performing a simple join between two tables. I run this query in Hive command line. I am naming table as a & b. Table a is Hive internal table and b is External table (in Cassandra). Table a has only 1610 rows and Table b has ~8million rows. In actual production scenario Table a could get upto 100K rows. Shown below is my join with table b as the last table in the join

SELECT a.col1, a.col2, b.col3, b.col4 FROM a JOIN b ON (a.col1=b.col1 AND a.col2=b.col2);

Shown below is the error

Total MapReduce jobs = 1
Execution log at: /tmp/pricadmn/.log
2014-04-09 07:15:36 Starting to launch local task to process map join; maximum memory = 932184064
2014-04-09 07:16:41 Processing rows: 200000 Hashtable size: 199999 Memory usage: 197529208 percentage: 0.212
2014-04-09 07:17:12 Processing rows: 300000 Hashtable size: 299999 Memory usage: 163894528 percentage: 0.176
2014-04-09 07:17:43 Processing rows: 400000 Hashtable size: 399999 Memory usage: 347109936 percentage: 0.372
...
...
...

2014-04-09 07:24:29 Processing rows: 1600000 Hashtable size: 1599999 Memory usage: 714454400 percentage: 0.766
2014-04-09 07:25:03 Processing rows: 1700000 Hashtable size: 1699999 Memory usage: 901427928 percentage: 0.967
Execution failed with exit status: 3
Obtaining error information


Task failed!
Task ID:
Stage-5

Logs:

/u/applic/pricadmn/dse-4.0.1/logs/hive/hive.log
FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask

I am using DSE 4.0.1. Following are few of my settings which you might be interested in
mapred.map.child.java.opts=-Xmx512M
mapred.reduce.child.java.opts=-Xmx512M
mapred.reduce.parallel.copies=20
hive.auto.convert.join=true

I increased mapred.map.child.java.opts to 1G and i got past few more records and then errored out. It doesn't look like a good solution. Also i changed the order in the join but no help. I saw this link Hive Map join : out of memory Exception but didn't solve my issue.

For me it looks Hive is trying to put the bigger table in memory during local task phase which i am confused. As per my understanding the second table (in my case table b) should be streamed in. Correct me if I am wrong. Any help in solving this issue is highly appreciated.

Community
  • 1
  • 1
user3517633
  • 111
  • 1
  • 1
  • 4

3 Answers3

34
set hive.auto.convert.join = false;
Konrad
  • 17,740
  • 16
  • 106
  • 167
Sahil Nagpal
  • 531
  • 4
  • 7
  • 11
    It would be great if you could explain in more detail, how this may solve the given problem. – blalasaadri Oct 31 '14 at 10:17
  • 3
    Check this page, maybe it helps. [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization) – Andras Toth Mar 26 '15 at 15:27
2

It appears your task is running out of memory. Check line 324 of the MapredLocalTask class.

 } catch (Throwable e) {
  if (e instanceof OutOfMemoryError
      || (e instanceof HiveException && e.getMessage().equals("RunOutOfMeomoryUsage"))) {
    // Don't create a new object if we are already out of memory
    return 3;
  } else {
Andrew Weaver
  • 548
  • 5
  • 8
  • I am curious why it is throwing OOM. Table a is very small, instead of putting table a in hashtable and streaming table b, why Hive is putting the bigger table in memory. My bigger table is the last one in my join statement. May be Hive uses some other logic internally. However i have tried with mapred.map.child.java.opts=-Xmx1024M but no help. Other than increasing the memory is there any other option? – user3517633 Apr 10 '14 at 19:04
  • There are a myriad of factors that can affect memory use in an MR job. I would increase the heap size incrementally to see if the job will run successfully with a larger, but still reasonably sized heap suited to your hardware. If you can't get away with that, then investigate more into why so much memory is being used to rule out a leak. You can also try using a smaller split size to process smaller chunks of data in each task. – Andrew Weaver Apr 10 '14 at 19:17
-2

Last join should be the largest table. You can change the order of join tables.

alexliu68
  • 310
  • 1
  • 2
  • Table b is my largest table and it is the last join in my query. If you don't mind can you re-write my query, may be I am missing something. – user3517633 Apr 10 '14 at 18:32