3

I have a 32 core system. When I run a MapReduce job using Hadoop I never see the java process use more than 150% CPU (according to top) and it usually stays around the 100% mark. It should be closer to 3200%.

Which property do I need to change (and in which file) to enable more workers?

Adam
  • 16,808
  • 7
  • 52
  • 98

2 Answers2

2

There could be two issues, which I outline below. I'd also like to point out that this is a very common question and you should look at the previously asked Hadoop questions.


Your mapred.tasktracker.map.tasks.maximum could be set low in conf/mapred-site.xml. This will be the issue if when you check the JobTracker, you see several pending tasks, but only a few running tasks. Each task is a single thread, so you would hypothetically need 32 maximum slots on that node.


Otherwise, likely your data is not being split into enough chunks. Are you running over a small amount of data? It could be that your MapReduce job is running over only a few input splits and thus does not require more mappers. Try running your job over hundreds of MB of data instead and see if you still have the same issue. Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).

In your conf/hdfs-site.xml configuration file, there is a dfs.block.size parameter. Most people set this to 64 or 128mb. However, if you are trying to do something tiny you could set this up to split up the work more.

You can also manually split your file into 32 chunks.

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • What do you mean "check the JobTracker"? All I can find about it is that it's a class, nothing about how to use it to actually check up on a job. – Adam Oct 10 '11 at 20:15
  • I had tried the `mapred.tasktracker.map.tasks.maximum` setting along with `mapred.tasktracker.map.tasks.maximum` and `mapred.tasktracker.reduce.tasks.maximum` with no effect. I set the blocksize to 8 MB and again no effect (my datafiles are about 200MB). Any other ideas? Are there some administrative utilities I can use to at least debug the issue? – Adam Oct 10 '11 at 20:17
  • Each service in Hadoop (NameNode, JobTracker, TaskTracker, DataNode) all have web interfaces. It doesn't seem to be well documented, but there is a bit of it here: http://hadoop.apache.org/common/docs/current/single_node_setup.html#Execution – Donald Miner Oct 10 '11 at 21:57
  • I don't think Hadoop will automatically re-split your files. Try re-ingesting the files, or copying them, or something. – Donald Miner Oct 10 '11 at 21:59
  • The thing you're most right about is the not well documented part. Despite following all the docs I could find, I wasn't starting Hadoop correctly. I'm trying to run someone else's code, and their docs don't say anything about setup either. – Adam Oct 10 '11 at 22:33
  • in other words, I wasn't starting the named node, the job tracker, the task tracker or the datanode. Their job used "hadoop" directly, not any of the daemons. Apparently that didn't read from the config files at all. – Adam Oct 10 '11 at 22:34
1

I think you need to set "mapreduce.framework.name" to "yarn",because the default value is "local".

put the following into your mapred-site.xml

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Zulu
  • 8,765
  • 9
  • 49
  • 56
iaalm
  • 23
  • 1
  • 5