1

I am trying to write a script to provide diagnostics on processes. I have submitted a script to a job scheduling server using qsub. I can easily find the node that the job gets sent to. But I would like to be able to find what process is currently being run. ie. I have a list of different commands in the submitted script, how can I find the current one that is running, and the arguments passed to it?

example of commands in script

matlab -nodesktop -nosplash -r "display('here'),quit"
python runsomethings.py

I would like to see whether the nodes is currently executing the first or second line.

2 Answers2

2

When you submit a job, pbs_server pass your task to pbs_mom. pbs_mom process/daemon actually executes your script on the execution node. It

"creates a new session as identical user."

This means invoking a shell. You specialize the shell at the top of the script marking your choice with shebang: #!/bin/bash).

It's clear, that pbs_mom stores process (shell) PID somewhere to kill the job and to monitor if the job (shell process) have finished.


UPD. based on @Dmitri Chubarov comment: pbs_mom stores subshell PID internally in memory after calling fork(), and in the .TK file which is under torque installation directory: /var/spool/torque/mom_priv/jobs on my system.

Dumping file internals in decimal mode (<job_number>, <queue_name> should be your own values):

$ hexdump -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK

have disclosed, that in my torque implementation it is stored in position 00000890 + offset 4*2 = 00000898 (it is hex value of first byte of PID in .TK file) and has a length of 2 bytes. For example, for shell PID=27110 I have:

0000890   00001   00000   00001   00000   27110   00000   00000   00000

Let's recover PID from .TK file:

$ hexdump -s 2200 -n 2 -d /var/spool/torque/mom_priv/jobs/<job_number>.<queue_name>.TK | tr -s ' ' | cut -s -d' ' -f 2
27110

This way you've found subshell PID.

Now, monitor process list on the execution node and find name of child processes (getcpid function is a slighlty modified version of that posted earlier on SO):

function getcpid() {
    cpids=`pgrep -P $1|xargs`
    for cpid in $cpids;
    do
        ps -p "$cpid" -o comm=
        getcpid $cpid
    done
}

At last,

getcpid <your_PID>

gives you the child processes' names (note, there will be some garbage lines, like task numbers). This way you will finally know, what command is currently running on the execution node.


Of course, for each task monitored, you should obtain the PID and process name on the execution node after doing

ssh <your node>

You can automatically retrieve node name(s) in <node/proc+node/proc+...> format (process it further to obtain bare node names):

qstat -n <job number> | awk '{print $NF}' | grep <pattern_for_your_node_names>

Note: The PID method is reliable and, as I believe, optimal. Search by name is worse, it provides you unambiguous result only if your invoke different commands in your scripts, and no user executes the same software on the node.

ssh <your node>
ps aux | grep matlab

You will know if matlab runs.

Community
  • 1
  • 1
John_West
  • 2,239
  • 4
  • 24
  • 44
  • 1
    The shell invoked by `pbs_mom` is a child process of `pbs_mom`. `pbs_mom` just calls `fork()` that returns the PID of the shell into the parent process. To survive `pbs_mom` restart, Torque also stores job data that includes the PID in a binary form in a per-job file located in `mom_priv` directory. The `*.TK` file contains a PID that can be used to kill the job. – Dima Chubarov Dec 23 '15 at 06:36
0

Simple and elegant way to do it is to print to a log file

`

ARGS=" $A $B $test "
echo "running MATLAB now with args: $ARGS" >> $LOGFILE
matlab -nodesktop -nosplash -r "display('here'),quit"

PYARGS="$X $Y"
echo "running Python now with args: $ARGS" >> $LOGFILE
python runsomethings.py

`

And monitor the output of $LOGFILE using tail -f $LOGFILE