10

I've searched and not finding much information related to Hadoop Datanode processes dying due to GC overhead limit exceeded, so I thought I'd post a question.

We are running a test where we need to confirm our Hadoop cluster can handle having ~3million files stored on it (currently a 4 node cluster). We are using a 64bit JVM and we've allocated 8g to the namenode. However, as my test program writes more files to DFS, the datanodes start dying off with this error: Exception in thread "DataNode: [/var/hadoop/data/hadoop/data]" java.lang.OutOfMemoryError: GC overhead limit exceeded

I saw some posts about some options (parallel GC?) I guess which can be set in hadoop-env.sh but I'm not too sure of the syntax and I'm kind of a newbie, so I didn't quite grok how it's done. Thanks for any help here!

hatrickpatrick
  • 195
  • 1
  • 2
  • 9
  • Just an update here for folks: @1.5million files in dfs, when my 64bit JVM was at 1g (default) the data nodes started dying with this error. When I upped it to 2g, it went away until I got to about 3 million files. I'm wondering if this kind of memory bloat is a known problem or not and if so, what other recommendations can I try to fix it? – hatrickpatrick Apr 11 '12 at 20:41
  • like Tejas Patil mentioned, the default block size is 64MB. Hadoop loads metadata for each file into memory each time it runs. The more files you have, the more memory it will take up. If these files are much smaller than default block size and you have the option to do so, try to combine files into bigger files to store to HDFS. just a thought :) – sufinawaz Aug 22 '13 at 21:55

4 Answers4

11

Try to increase the memory for datanode by using this: (hadoop restart required for this to work)

export HADOOP_DATANODE_OPTS="-Xmx10g"

This will set the heap to 10gb...you can increase as per your need.

You can also paste this at the start in $HADOOP_CONF_DIR/hadoop-env.sh file.

Tejas Patil
  • 6,149
  • 1
  • 23
  • 38
  • 2
    This basically solved it, but I've also learned that when you are storing a lot of files on a small cluster, the DataNode usage climbs fast because there are limited places replication can occur. If we add nodes, then data node memory shouldn't climb as quickly (so I hear!). – hatrickpatrick Jun 12 '12 at 20:48
  • 2
    @hatrickpatrick HDFS uses 64 MB blocks for file storage...if files are small, then a lot of memory will be wasted and even namenode will have to keep a track of those. Having few but massive files is better than having many small files. – Tejas Patil Jun 16 '12 at 08:05
  • What's the default `-Xmx` for `HADOOP_DATANODE_OPTS` – Sida Zhou Aug 27 '20 at 08:13
  • The default is 200m – Sida Zhou Aug 27 '20 at 08:30
0

If you are running a map reduce job from command line, you can increase the heap using the parameter -D 'mapreduce.map.java.opts=-Xmx1024m' and/or -D 'mapreduce.reduce.java.opts=-Xmx1024m'. Example:

hadoop --config /etc/hadoop/conf jar /usr/lib/hbase-solr/tools/hbase-indexer-mr-*-job.jar --conf /etc/hbase/conf/hbase-site.xml -D 'mapreduce.map.java.opts=-Xmx1024m' --hbase-indexer-file $HOME/morphline-hbase-mapper.xml --zk-host 127.0.0.1/solr --collection hbase-collection1 --go-live --log4j /home/cloudera/morphlines/log4j.properties

Note that in some Cloudera documentation, they still use the old parameters mapred.child.java.opts, mapred.map.child.java.opts and mapred.reduce.child.java.opts. These parameters don't work anymore for Hadoop 2 (see What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?).

Community
  • 1
  • 1
stefan.m
  • 1,912
  • 4
  • 20
  • 36
0

This post solved the issue for me.

So, the key is to "Prepend that environment variable" (1st time seen this linux command syntax :) )

HADOOP_CLIENT_OPTS="-Xmx10g" hadoop jar "your.jar" "source.dir" "target.dir"
pheeleeppoo
  • 1,491
  • 6
  • 25
  • 29
Khalid Mammadov
  • 511
  • 4
  • 6
-2

GC overhead limit indicates that your (tiny) heap is full.

This is what often happens in MapReduce operations when u process a lot of data. Try this:

< property >

  < name > mapred.child.java.opts < /name >

   < value > -Xmx1024m -XX:-UseGCOverheadLimit < /value >

< /property >

Also, try these following things:

Use combiners, the reducers shouldn't get any lists longer than a small multiple of the number of maps

At the same time, you can generate heap dump from OOME and analyze with YourKit, etc adn analyze it

Alexan
  • 8,165
  • 14
  • 74
  • 101
  • @ThomasJungblut +1. mapred.child.java.opts can be used control heap for the hadoop jobs spawned and not the datanode. – Tejas Patil Apr 13 '12 at 09:16
  • 1
    okay, i did not checked it But, actually his problem is of two types: (1) Data nodes memory limitation (2) In between steps sorting etc. So, my point is we cann't blindly increase the data node heap size t0 10 GB, 20 GB like that, if we can tune with parameters (as specified above) and use combiners, i think that the solution would be good. – shiva kumar s Apr 13 '12 at 18:58