how to tune the "DataNode maximum Java heap size" in hadoop clusters

Question

I searched in google to find info about how to tune the value for - DataNode maximum Java heap size ,except this one -

https://community.hortonworks.com/articles/74076/datanode-high-heap-size-alert.html

https://docs.oracle.com/cd/E19900-01/819-4742/abeik/index.html

but not found formula to calculate the value for DataNode maximum Java heap size

the default value for DataNode maximum Java heap size , is 1G

and we increase this value to 5G , because in some case we saw from datanode logs error about heap size

but this isn't the right way to tune the value

so any suggestion or good article how to set the right value for - datanode logs error about heap size ?

lets say we have the following hadoop cluster size:

10 datanode machines , with 5 disks , while each disk has 1T
Each data node have 32 CPU
Each data node have 256G memory

Based on this info can we find the formula that show the right value for - "datanode logs error about heap size" ?

regarding to hortonworks: they advice to set the Datanode java heap to 4G but I am not sure if this case can covered all scenario?

ROOT CAUSE: DN operations are IO expensive do not require 16GB of the heap.

https://community.hortonworks.com/articles/74076/datanode-high-heap-size-alert.html

RESOLUTION: Tuning GC parameters resolved the issue -
4GB Heap recommendation : 
-Xms4096m -Xmx4096m -XX:NewSize=800m 
-XX:MaxNewSize=800m -XX:+UseParNewGC 
-XX:+UseConcMarkSweepGC 
-XX:+UseCMSInitiatingOccupancyOnly 
-XX:CMSInitiatingOccupancyFraction=70 
-XX:ParallelGCThreads=8

OneCricketeer · Accepted Answer · 2018-12-06T15:53:50.157

3

In hadoop-env.sh (also some field in Ambari, just try searching for heap), there's an option for setting the value. Might be called HADOOP_DATANODE_OPTS in the shell file

8GB is generally a good value for most servers. You have enough memory, though, so I would start there, and actively monitor the usage via JMX metrics in Grafana, for example.

The namenode might need adjusted as well https://community.hortonworks.com/articles/43838/scaling-the-hdfs-namenode-part-1.html

edited Dec 06 '18 at 15:53

answered Dec 06 '18 at 15:42

OneCricketeer

179,855
19
132
245

, see my update in the question , hortonwork suggest to limit the size to 4G and reconfigure the other values as -XX:NewSize=800m , and -XX:MaxNewSize=800m -XX:+UseParNewGC , etc , so seems that they not satisfy the high value of the datanode java heap size , what is your opinion ? – Judy Dec 06 '18 at 15:50
I'm not sure what the datanode really needs heap for because they are just holding data on disk, not in memory or processing it actively – OneCricketeer Dec 06 '18 at 15:55
I would maybe try G1GC if you are running into memory issues , but a true resolution would require a heap dump analysis over time – OneCricketeer Dec 06 '18 at 15:56
yes you right and because that , I not understand why you happy with 8G , I think this is high value – Judy Dec 06 '18 at 15:56
so do you think that it will be good to take the configuration from hortonworks ? – Judy Dec 06 '18 at 15:59
HADOOP_DATANODE_OPTS="-Xms4096m -Xmx4096m -XX:NewSize=800m -XX:MaxNewSize=800m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:ParallelGCThreads=8" – Judy Dec 06 '18 at 16:00
It's there platform, and they deal with these issues all the time, so I'm not sure why you're not trusting them. I don't think you need 8, I'm just saying you have enough memory to support it. However, monitoring the heap usage will tell you if you won't need all of it – OneCricketeer Dec 06 '18 at 16:00
do you have advice how to - monitoring the heap usage ? ( its CLI or API or else ? – Judy Dec 06 '18 at 16:01
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/184829/discussion-between-judy-and-cricket-007). – Judy Dec 06 '18 at 17:02
Like I said, Grafana is a good option. For example - https://risdenk.github.io/2018/03/12/improving-hortonworks-hdp-monitoring.html – OneCricketeer Dec 06 '18 at 17:26

score -1 · Answer 2 · answered Feb 10 '21 at 04:25

-1

the recommendation is to keep it 1GB per million data blocks.

answered Feb 10 '21 at 04:25

Sumit Khurana

159
1
10

This is for namenode, not datanode – OneCricketeer Aug 19 '22 at 16:15

how to tune the "DataNode maximum Java heap size" in hadoop clusters

2 Answers2

Linked