4

When qsubing jobs on a StarCluster / SGE cluster, is there an easy way to ensure that each node receives at most one job at a time? I am having issues where multiple jobs end up on the same node leading to out of memory (OOM) issues.

I tried using -l cpu=8 but I think that does not check the number of USED cores just the number of cores on the box itself.

I also tried -l slots=8 but then I get:

Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.
Alex Rothberg
  • 10,243
  • 13
  • 60
  • 120

3 Answers3

4

In your config file (.starcluster/config) add this section:

[plugin sge]
setup_class = starcluster.plugins.sge.SGEPlugin
slots_per_host = 1
Tobias
  • 56
  • 1
1

Largely depends on how the cluster resources are configured i.e. memory limits, etc. However, one thing to try is to request a lot of memory for each job:

-l h_vmem=xxG

This will have side-effect of excluding other jobs from running on a node by virtue that most of the memory on that node is already requested by another previously running job.

Just make sure the memory you request is not above the allowable limit for the node. You can see if it bypassing this limit by checking the output of qstat -j <jobid> for errors.

Vince
  • 3,325
  • 2
  • 23
  • 41
  • Is there no way to be more explicit about one job per node than hoping that the setting in `h_vmem` is enough? This is also a little scary: "If h_vmem is exceeded by a job running in the queue, it is aborted via a SIGKILL signal". It seems like the slots mechanism would be the right thing to use. – Alex Rothberg Sep 04 '14 at 20:11
  • From my limited know-how there is no *direct* way to limit to 1 slot per node using qsub. The idea is that SGE takes care of load balancing for you. Alternative to memory trick above is to use parallel environment and request a bunch of slots per job, thus tricking SGE into thinking that node is full. Alternatively, and likely the best solution if you have admin access to SGE would be to create another queue with 1 slot allocated per node. Another idea I just had ... you can use the `-l hostname=` option to target a specific host, but that would require some bash scripting. – Vince Sep 05 '14 at 12:32
-1

I accomplished this by setting the number of slots on each my nodes to 1 using: qconf -aattr queue slots "[nodeXXX=1]" all.q

Alex Rothberg
  • 10,243
  • 13
  • 60
  • 120