I'm having some difficulties with my scripts. The purpose is to launch one or several OpenVZ container to execute some test. Those test can be very long (about 3 hours usually).
The first script goes this way, after sorting the queue member to launch, it does:
subprocess.Popen(QUEUE_EXECUTER % queue['queue_id'], shell=True)
Where "QUEUE_EXECUTER % queue['queue_id']" is the complete command to run. In the queue_executer script it goes this way :
# Launching install
cmd = queue['cmd_install']
report_install = open(queue['report_install'], 'a')
process_install = subprocess.Popen(cmd, shell=True, stdout=report_install, stderr=subprocess.STDOUT)
process_install.wait()
# Launching test
logger.debug('Launching test')
report_test = open(queue['report_test'], 'a')
cmd = queue['cmd_test']
process_test = subprocess.Popen(cmd, shell=True, stdout=report_test, stderr=subprocess.STDOUT)
process_test.wait()
It works quite fine, but some time, and more recently, most of the time, the execution is stopped. No error in the logs or anything. The report file shows that it stopped right in the middle of the writing of a line (which, I believe is because the file isn't correctly close on the python side). On the host side the OOM killer don't seem to do anything, and I've searched through the host's logs without finding anything either.
The two "cmd" launched above are shell script which basically set a vz up, and execute a test program on it.
So my big question is : Am I missing something which would cause the scripts to stop on the python side ?
Thanks.
EDIT : Some complementary informations.
The command which fail is always the second one. Here are two example values of the commands I try to execute : /path/vzspawncluster.sh /tmp/file web --tarball /services/pkgs/etch/releases/archive.tar.gz --create
and /path/vzlaunch.sh 172 -b trunk --args "-a -v -s --time --cluster --sql=qa3 --queue=223 --html --mail=adress@mail.com"
The vzlaunch script launch a python script on a OpenVZ container with vzctl enter ID /path/script.py
where ID is the container ID and /path/script.py the script on the container.
The machine report_install and report_test are files situated on a different machine accessed through a NFS share. That should not matter, but as I really don't know what going on when it fails, I note it anyway.
When it fails, the process on the container die. It does not remain in any state of zombieness or anything, it's just dead. Although the process on the container fails, the the main process (the one that launchs them all) continue as if everything was fine.
Some more info: I tried the buffer-flushing approach pointed by smci but the writing of my log file keep being cut right in the middle of a line :
[18:55:27][Scripteo] Create process '/QA/valideo.trunk/tests/756/test.py -i 10.1.11.122 --report --verbose --name 756 --...
[18:56:35][Scripteo] Create process '/QA/valideo.trunk/tests/762/test.py -i 10.1.11.122 --report --verbose --name 762 --...
[18:57:56][Scripteo] Create process '/QA/valideo.trunk/tests/764/test.py -i 10.1.11.122 --report --verbose --name 764 --...
[18:59:27][Scripteo] Create process '/QA/valideo.trunk/tests/789/test.py -i 10.1.11.122 --report --verbose --name 789 --...
[19:00:44][Scripteo] Create process '/QA/valideo.trunk/tests/866/test.py -i 10.1.11.122 --report --verbose --name 866 --...
[19:02:27][Scripteo] Create process '/QA/valideo.trunk/tests/867/test.py -i 10.1.11.122 --report --verbose --name 867 --...
[19:04:13][Scripteo] Create process '/QA/valideo.trunk/tests/874/t