I'm running 500 jobs in a Sun Engine Cluster, and I'm having some problems (the administrator of the cluster suspended my jobs because I was over spawning CPU). The code is written in Java.
When I run one of the jobs in my PC (Ubuntu 14.04), and use the htop
command to see what's going on, I get this:
I've seen that those are not separate processes, but threads. The code does not generate threads, but they probably are some Java threads (like garbage collector). The first problem is: when I run the same test on the cluster, and use htop, I have much more threads/processes, around 50 (for only one job). Does anybody knows why this might be happening?
I'm using the following options with qsub:
qsub -t 1-500 -l h_rt=05:00:00 -l h_cpu=05:00:00 -l h_vmem=6G -e /some_path/ -o /some_path/ -N all_runs -cwd -m as -M mail@mail ./run.sh
(In run.sh
I have all the jobs specified).
With this qsub
command each job gets 1 slot, and the use of the CPU is sometimes 150 - 200% (-> 1 slot is not enough). I saw that the cluster has a parallel environment, so more slots can be assign to each job. This can be done adding -pe smp 4
(or some other number) to the qsub
command.
How can you know how many slots do you need? And, -pe smp 4
will strictly limit to a max of 4 slots? I mean, when a job have 1 slot, and the use of the CPU is 200%, it can affect other users jobs. I want to be sure that that cannot happen.
If there is some important information missing please let me know and I'll add it.