I am using gnu parallel to launch code on a high performance (HPC) computing cluster that has 2 CPUs per node. The cluster uses TORQUE portable batch system (PBS). My question is to clarify how the --jobs option for GNU parallel works in this scenario.
When I run a PBS script calling GNU parallel without the --jobs option, like this:
#PBS -lnodes=2:ppn=2
...
parallel --env $PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
matlab -nodiplay -r "\"cd $PBS_O_WORKDIR,primes1({})\"" ::: 10 20 30 40
it looks like it only uses one CPU per core, and also provides the following error stream:
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles087 (). Using 1.
bash: parallel: command not found
parallel: Warning: Could not figure out number of cpus on galles108 (). Using 1.
This looks like one error for each node. I don't understand the first part (bash: parallel: command not found
), but the second part tells me it's using one node.
When I add the option -j2
to the parallel call, the errors go away, and I think that it's using two CPUs per node. I am still a newbie to HPC, so my way of checking this is to output date-time stamps from my code (the dummy matlab code takes 10's of seconds to complete). My questions are:
- Am I using the
--jobs
option correctly? Is it correct to specify-j2
because I have 2 CPUs per node? Or should I be using-jN
where N is the total number of CPUs (number of nodes multiplied by number of CPUs per node)? - It appears that GNU parallel attempts to determine the number of CPUs per node on it's own. Is there a way that I can make this work properly?
- Is there any meaning to the
bash: parallel: command not found
message?