I'm running a complex query in hive which, when ran, starts using a huge amount of local disk space in /tmp folder and eventually ends with a space error as the /tmp folder fills up completely with the intermediate map-reduce results because of the mentioned query (/tmp folder is created in a separate partition, having 100 GB of empty space). While running it says:
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Job running in-process (local Hadoop)
As you can see above, Hive is somehow running in local mode. After doing some research over the net, I checked a few relevant parameters and below are the results:
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive> set mapred.job.tracker;
mapred.job.tracker=local
hive> set mapred.local.dir;
mapred.local.dir=/tmp/hadoop-hive/mapred/local
So I have two questions regarding this:
- Can this be the reason why the map-reduce jobs are consuming space on local disk instead of hdfs /tmp folder, as is the case typically with pig scripts?
- How to make Hive run in distributed mode, given the current settings? Please mind that I'm using MRV2 in the cluster, but the above options are confusing as they seem to be relevant for MRV1. I can be wrong here, being a newbee.
Any help will be much appreciated!