GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

Question

I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.

When I run a PBS script calling GNU parallel without the --jobs option, like this:

#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

it looks like it only uses one CPU per core, and also provides the following error stream:

bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.

This looks like one error for each node. I don't understand the first part (bash: parallel: command not found), but the second part tells me it's using one node.

When I add the option -j2 to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:

Am I using the --jobs option correctly? Is it correct to specify -j2 because I have 2 CPUs per node? Or should I be using -jN where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)?
It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
Is there any meaning to the bash: parallel: command not found message?

score 5 · Accepted Answer · answered Mar 07 '14 at 09:32

5

Yes: -j is the number of jobs per node.
Yes: Install 'parallel' in your $PATH on the remote hosts.
Yes: It is a consequence from parallel missing from the $PATH.

GNU Parallel logs into the remote machine; tries to determine the number of cores (using parallel --number-of-cores) which fails and then defaults to 1 CPU core per host. By giving -j2 GNU Parallel will not try to determine the number of cores.

Did you know that you can also give the number of cores in the --sshlogin as: 4/myserver ? This is useful if you have a mix of machines with different number of cores.

answered Mar 07 '14 at 09:32

Ole Tange

31,768
5
86
104

Sometimes gnu-parallel fails to determine the number of cores. However, for systems with /proc/info such as some debian systems you can use `LOGICALCPUCOUNT=$(ssh -o PreferredAuthentications=publickey $NODE_IP grep -c "processor" /proc/cpuinfo)` or `PHYSICALCPUCOUNT=$(ssh -o PreferredAuthentications=publickey $NODE_IP grep "cpu\ cores" /proc/cpuinfo 2>/dev/null |sort -u |cut -d":" -f2|awk '{s+=$1} END {print s}')` to retrieve either the logical or phyical cores available at the node in question respectively. – Mr Purple Sep 21 '15 at 21:54
@MrPurple Can you determine on which systems it fails, so I can reproduce it? Maybe it is certain kernel versions? It should be fixed rather than worked around. – Ole Tange Sep 22 '15 at 06:57
Well I used to get that error but now I can't replicate it... Currently running kernel 3.16.0-49-generic and GNU parallel 20130922 on all systems. – Mr Purple Sep 22 '15 at 08:20
@MrPurple If/when you can replicate it, please file a bug report. – Ole Tange Sep 22 '15 at 08:39

score 2 · Answer 2 · answered Nov 06 '15 at 17:36

This is not an answer to the 3 primary questions, but I'd like to point out some other problems with the parallel statement in the first code block.

parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40

The shell expands the $PBS_O_WORKDIR prior to executing parallel. This means two things happen (1) the --env sees a filename rather than an environment variable name and essentially does nothing and (2) expands as part command string eliminating the need to pass $PBS_O_WORKDIR which is why there wasn't an error.

The latest version of parallel 20151022 has a workdir option (although the tutorial lists it as alpha testing) which is probably the easiest solution. The parallel command line would look something like:

parallel --workdir $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
  matlab -nodisplay -r "primes1({})" :::: 10 20 30 40

Final note, PBS_NODEFILE may contain hosts listed multiple times if more than one processor is requested by qsub. This many have implications for number of jobs run, etc.

GNU parallel --jobs option using multiple nodes on cluster with multiple cpus per node

2 Answers2

Linked