I suspect Popen to timeout without saying

Question

I'm having some difficulties with my scripts. The purpose is to launch one or several OpenVZ container to execute some test. Those test can be very long (about 3 hours usually).

The first script goes this way, after sorting the queue member to launch, it does:

subprocess.Popen(QUEUE_EXECUTER % queue['queue_id'], shell=True)

Where "QUEUE_EXECUTER % queue['queue_id']" is the complete command to run. In the queue_executer script it goes this way :

# Launching install
cmd = queue['cmd_install']
report_install = open(queue['report_install'], 'a')
process_install = subprocess.Popen(cmd, shell=True, stdout=report_install, stderr=subprocess.STDOUT)
process_install.wait()

# Launching test
logger.debug('Launching test')
report_test = open(queue['report_test'], 'a')
cmd = queue['cmd_test']
process_test = subprocess.Popen(cmd, shell=True, stdout=report_test, stderr=subprocess.STDOUT)
process_test.wait()

It works quite fine, but some time, and more recently, most of the time, the execution is stopped. No error in the logs or anything. The report file shows that it stopped right in the middle of the writing of a line (which, I believe is because the file isn't correctly close on the python side). On the host side the OOM killer don't seem to do anything, and I've searched through the host's logs without finding anything either.

The two "cmd" launched above are shell script which basically set a vz up, and execute a test program on it.

So my big question is : Am I missing something which would cause the scripts to stop on the python side ?

Thanks.

EDIT : Some complementary informations.

The command which fail is always the second one. Here are two example values of the commands I try to execute : /path/vzspawncluster.sh /tmp/file web --tarball /services/pkgs/etch/releases/archive.tar.gz --create and /path/vzlaunch.sh 172 -b trunk --args "-a -v -s --time --cluster --sql=qa3 --queue=223 --html --mail=adress@mail.com"

The vzlaunch script launch a python script on a OpenVZ container with vzctl enter ID /path/script.py where ID is the container ID and /path/script.py the script on the container.

The machine report_install and report_test are files situated on a different machine accessed through a NFS share. That should not matter, but as I really don't know what going on when it fails, I note it anyway.

When it fails, the process on the container die. It does not remain in any state of zombieness or anything, it's just dead. Although the process on the container fails, the the main process (the one that launchs them all) continue as if everything was fine.

Some more info: I tried the buffer-flushing approach pointed by smci but the writing of my log file keep being cut right in the middle of a line :

[18:55:27][Scripteo]       Create process '/QA/valideo.trunk/tests/756/test.py -i 10.1.11.122 --report --verbose --name 756 --...
[18:56:35][Scripteo]       Create process '/QA/valideo.trunk/tests/762/test.py -i 10.1.11.122 --report --verbose --name 762 --...
[18:57:56][Scripteo]       Create process '/QA/valideo.trunk/tests/764/test.py -i 10.1.11.122 --report --verbose --name 764 --...
[18:59:27][Scripteo]       Create process '/QA/valideo.trunk/tests/789/test.py -i 10.1.11.122 --report --verbose --name 789 --...
[19:00:44][Scripteo]       Create process '/QA/valideo.trunk/tests/866/test.py -i 10.1.11.122 --report --verbose --name 866 --...
[19:02:27][Scripteo]       Create process '/QA/valideo.trunk/tests/867/test.py -i 10.1.11.122 --report --verbose --name 867 --...
[19:04:13][Scripteo]       Create process '/QA/valideo.trunk/tests/874/t

What do you mean by _'The machine report_install and report_test are files situated on a different machine accessed through a NFS share.'_ Do you mean the _files_ are on different machines, or the _jobs_, or both? Do you mean both jobs are running in parallel on different machines? Why not run both jobs on the known-good machine? or try running the failing one first? I'm not clear whether these two containers are interdependent or not. — smci, Jul 20 '11 at 09:35
My script isn't the only one which produce this type of report, that's why they were centralized on a distant machine through an NFS share. (Only the file were located there, the job was running on the local machine, writing through the network on a distant machine). I moved all the log to a local directory and it doesn't seem to crash anymore. I have the feeling though I only cure the sympthomes but the problem is still there. Anyway, thanks for your help ! — jaes, Aug 04 '11 at 07:18

score 0 · Answer 1 · edited May 23 '17 at 12:11

Your intent is first to run process_install until it finishes, then run process_wait? (sequentially, not multiprocessing, right?) Which command do you suspect to timeout?

Please paste the actual values of queue['cmd_install'], queue['cmd_test']

(Does either of those commands have a trailing '&' or redirects?)

Here are my debugging suggestions:

(I don't know OpenVZ, but I assume you've checked the logs and whether it allows running at-exit commands)
Are you running on UNIX? If so, you could play around with the commands to run cmd in the background and also run a loop to generate output e.g. a while(1) to touch a sentinel file then sleep 10s. Or you could cmd; touch donesentinel.
Try adding a polling loop to poll() each Popen object every interval, instead of wait().
Alternatively, print Popen.pid after it launches, then externally check or poll that process is still alive (e.g. with UNIX top -p).
If your process generates a lotta output, did you note the caveat on Popen.wait()? Warning: This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.
If you suspect that is happening, redirect either/both of stdout, stderr to os.devnull and see whether your results differ. Or see this buffer-flushing approach.

Hi, and thanks for the anwser. I will add informations to my main post but here I can already say I searched through all kind of logs, on the container and on the host but found nothing. I looked into every ways the scripts launched could leave a file descriptor open or fill the buffer in a way that would cause the script to stop but again, found nothing. On the advice of a coworker, I tried executing the scripts with "python -u" which would forbid python to buffer anything but that changed nothing either. I'll try to change the way I wait for the script to end as you advice me. — jaes, Jul 18 '11 at 08:33

I suspect Popen to timeout without saying

1 Answers1