We have a couple of SGE clusters running various versions of RHEL at my work and we're testing a new one with a newer Redhat, all . On the old cluster ("Centos release 5.4"), I'm able to submit a job like the following one and it runs fine:
echo "java -Xms8G -Xmx8G -jar blah.jar ..." |qsub ... -l h_vmem=10G,virtual_free=10G ...
On the new cluster "CentOS release 6.2 (Final)", a job with those parameters fails due to running out of memory, and I have to change the h_vmem to h_vmem=17G in order for it to succeed. The new nodes have about 3x the RAM of the old node and in testing I'm only putting in a couple of jobs at a time.
On the old cluster, I'd set the -Xms/Xms
to be N
, I could use N+1
or so for the h_vmem
. On the new cluster, I seem to be crashing unless I set h_vmem
to be 2N+1
.
I wrote a tiny perl script that all it does is progressively use consume more memory and periodically print out the memory used until it crashes or it reaches a limit. The h_vmem parameter makes it crash at the expected memory usage.
I've tried multiple versions of the JVM (1.6 and 1.7). If I omit the h_vmem
, it works, but then things are riskier to run.
I have googled where others have seen similar issues, but no resolutions found.