When you submit a job, pbs_server
pass your task to pbs_mom
. pbs_mom
process/daemon actually executes your script on the execution node. It
"creates a new session as identical user."
This means invoking a shell. You specialize the shell at the top of the script marking your choice with shebang: #!/bin/bash
).
It's clear, that pbs_mom
stores process (shell) PID
somewhere to kill the job and to monitor if the job (shell process) have finished.
UPD. based on @Dmitri Chubarov comment: pbs_mom
stores subshell PID
internally in memory after calling fork()
, and in the .TK
file which is under torque
installation directory: /var/spool/torque/mom_priv/jobs
on my system.
Dumping file internals in decimal mode (<job_number>
, <queue_name>
should be your own values):
$ hexdump -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK
have disclosed, that in my torque implementation it is stored in position
00000890 + offset 4*2 = 00000898
(it is hex value of first byte of PID
in .TK
file) and has a length of 2
bytes.
For example, for shell PID=27110
I have:
0000890 00001 00000 00001 00000 27110 00000 00000 00000
Let's recover PID
from .TK
file:
$ hexdump -s 2200 -n 2 -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK | tr -s ' ' | cut -s -d' ' -f 2
27110
This way you've found subshell PID.
Now, monitor process list on the execution node and find name of child processes (getcpid function is a slighlty modified version of that posted earlier on SO):
function getcpid() {
cpids=`pgrep -P $1|xargs`
for cpid in $cpids;
do
ps -p "$cpid" -o comm=
getcpid $cpid
done
}
At last,
getcpid <your_PID>
gives you the child processes' names (note, there will be some garbage lines, like task numbers). This way you will finally know, what command is currently running on the execution node.
Of course, for each task monitored, you should obtain the PID
and process name on the execution node after doing
ssh <your node>
You can automatically retrieve node name(s) in <node/proc+node/proc+...>
format (process it further to obtain bare node names):
qstat -n <job number> | awk '{print $NF}' | grep <pattern_for_your_node_names>
Note:
The PID
method is reliable and, as I believe, optimal.
Search by name is worse, it provides you unambiguous result only if your invoke different commands in your scripts, and no user executes the same software on the node.
ssh <your node>
ps aux | grep matlab
You will know if matlab
runs.