4

Has anybody similar issue with in R (build 1060) on top of sandbox Hadoop (Cloudera5.1/Hortonworks2.1)? It seems to be a problem of new R/Hadoop, because on CDH5.0 it works.

Code:

Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/lib/hadoop-mapreduce/hadoop-streaming.jar")
Sys.setenv(JAVA_HOME="/usr/java/jdk1.7.0_55-cloudera")
library(rhdfs)
library(rmr2)
hdfs.init()

## space and word delimiter
map <- function(k,lines) {
  words.list <- strsplit(lines, '\\s')
  words <- unlist(words.list)
  return( keyval(words, 1) )
}
reduce <- function(word, counts) {
  keyval(word, sum(counts))
}
wordcount <- function (input, output=NULL) {
  mapreduce(input=input, output=output, input.format="text", map=map, reduce=reduce)
}

## variables
hdfs.root <- '/user/cloudera'
hdfs.data <- file.path(hdfs.root, 'scenario_1')
hdfs.out <- file.path(hdfs.root, 'out')

## run mapreduce job
##out <- wordcount(hdfs.data, hdfs.out)
system.time(out <- wordcount(hdfs.data, hdfs.out))

Error:

> system.time(out <- wordcount(hdfs.data, hdfs.out))
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.3.0-cdh5.1.0.jar] /tmp/streamjob8497498354509963133.jar tmpDir=null
14/09/17 01:49:38 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
14/09/17 01:49:38 INFO client.RMProxy: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
14/09/17 01:49:39 INFO mapred.FileInputFormat: Total input paths to process : 1
14/09/17 01:49:39 INFO mapreduce.JobSubmitter: number of splits:2
14/09/17 01:49:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1410940439997_0001
14/09/17 01:49:40 INFO impl.YarnClientImpl: Submitted application application_1410940439997_0001
14/09/17 01:49:40 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1410940439997_0001/
14/09/17 01:49:40 INFO mapreduce.Job: Running job: job_1410940439997_0001
14/09/17 01:49:54 INFO mapreduce.Job: Job job_1410940439997_0001 running in uber mode : false
14/09/17 01:49:54 INFO mapreduce.Job:  map 100% reduce 100%
14/09/17 01:49:55 INFO mapreduce.Job: Job job_1410940439997_0001 failed with state KILLED due to: MAP capability required is more than the supported max container capability in the cluster. Killing the Job. mapResourceReqt: 4096 maxContainerCapability:1024
Job received Kill while in RUNNING state.
REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. **reduceResourceReqt: 4096 maxContainerCapability:1024**

14/09/17 01:49:55 INFO mapreduce.Job: Counters: 2
    Job Counters 
        Total time spent by all maps in occupied slots (ms)=0
        Total time spent by all reduces in occupied slots (ms)=0
14/09/17 01:49:55 ERROR streaming.StreamJob: Job not Successful!
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce, : hadoop streaming failed with error code 1
Timing stopped at: 3.681 0.695 20.43 

Seems that the issue is in reduceResourceReqt: 4096 maxContainerCapability:1024. I have tried to change: yarn-site.xml, but it didn't help. :(

Please, help...

yottalab
  • 76
  • 1
  • 5

1 Answers1

6

I have not used RHadoop. However I've had a very similar problem on my cluster, and this problem seems to be linked only to MapReduce.

The maxContainerCapability in this log refers to the yarn.scheduler.maximum-allocation-mb property of your yarn-site.xml configuration. It is the maximum amount of memory that can be used in any container.

The mapResourceReqt and reduceResourceReqt in your log refer to the mapreduce.map.memory.mb and mapreduce.reduce.memory.mb properties of your mapred-site.xml configuration. It is the memory size of the containers that will be created for a Mapper or a Reducer in mapreduce.

If the size of your Reducer's container is set to be greater than yarn.scheduler.maximum-allocation-mb, which seems to be the case here, your job will be killed because it is not allowed to allocate so much memory to a container.

Check your configuration at http://[your-resource-manager]:8088/conf and you should normally find these values and see that this is the case.

Maybe your new environment has these values set to 4096 Mb (which is quite big, the default in Hadoop 2.7.1 being 1024).

Solution

You should either lower the mapreduce.[map|reduce].memory.mb values down to 1024, or if you have lots of memory and want huge containers, raise the yarn.scheduler.maximum-allocation-mb value to 4096. Only then MapReduce be able to create containers.

I hope this helps.

Nicomak
  • 2,319
  • 1
  • 21
  • 23