14

I have a python script that is using the following to restart:

python = sys.executable
os.execl(python, python, * sys.argv)

Most the time this works fine, but occasionally the restart fails with a no module named error. Examples:

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
File "/usr/lib/python2.7/posixpath.py", line 17, in <module>
import warnings
File "/usr/lib/python2.7/warnings.py", line 6, in <module>
import linecache
ImportError: No module named linecache

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
 File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
 File "/usr/lib/python2.7/posixpath.py", line 15, in <module>
import stat   
ImportError: No module named stat

Edit: I attempted gc.collect() as suggested by andr0x and this did not work. I got the same error:

Traceback (most recent call last):
File "/usr/lib/python2.7/site.py", line 68, in <module>
import os
File "/usr/lib/python2.7/os.py", line 49, in <module>
import posixpath as path
ImportError: No module named posixpath

Edit 2: I tried sys.stdout.flush() and im still getting the same error. I've noticed I am only every getting between 1-3 successful restarts before an error occurs.

Martyn
  • 806
  • 1
  • 8
  • 20
  • Can you give more information what "occasionally" means? Have you tried your example in a loop? How often does it fail? Is there something in your script that could cause this behaviour? – Fabian Nov 19 '13 at 14:48
  • The script runs constantly, and fails after 2+ days or so. The script restarts around every 8-15 hours ish – Martyn Nov 19 '13 at 14:50
  • What OS and version? What minor version of 2.7.x? – TkTech Nov 19 '13 at 16:34
  • Python is 2.7.2+ and OS is Linux Mint 12 Lisa 3.0.0-14-generic – Martyn Nov 20 '13 at 08:14

3 Answers3

8

I believe you are hitting the following bug:

http://bugs.python.org/issue16981

As it is unlikely that these modules are disappearing there must be another error that is actually at fault. The bug report lists 'too many open files' as prone to causing this issue however I am unsure if there are any other errors which will also trigger this.

I would make sure you are closing any file handles before hitting the restart code. You can also actually force the garbage collector to run manually with:

import gc
gc.collect()

http://docs.python.org/2/library/gc.html

You can try using that before hitting the restart code as well

imandrewd
  • 343
  • 1
  • 8
  • I can't see any open file handles during a restart. I have added the gc code to my restart function and will see if that helps. – Martyn Nov 19 '13 at 15:00
  • Nope this did not work. Got another error on friday. I edited the original post – Martyn Nov 25 '13 at 09:19
  • 1
    As [Python doc](http://docs.python.org/2/library/os.html) says, The current process is replaced immediately. Open file objects and descriptors are not flushed. so, how about flush them using sys.stdout.flush() or os.fsync() before calling an exec* function.? – user2390183 Nov 25 '13 at 19:21
  • Tried sys.stdout.flush. This did not help me. – Martyn Nov 28 '13 at 16:03
3

If the problem is too many files get opened, then you have to set the FD_CLOEXEC flag on the file descriptors to get them to close when exec happens. Here is a piece of code that simulates hitting the file descriptor limit while reloading and which contains a fix for not hitting the limit. If you want to simulate a crash, set fixit to False. When fixit is True, the code goes through the list of file descriptors and sets them as FD_CLOEXEC. This works on Linux. People working on systems that don't have /proc/<pid>/fd/ will have to find a system-appropriate way to list the open file descriptors. This question may help.

import os
import sys
import fcntl

pid = str(os.getpid())

def fds():
    return os.listdir(os.path.join("/proc", pid, "fd"))

files = []

print "Number of files open at start:", len(fds())

for i in xrange(0, 102):
    files.append(open("/dev/null", 'r'))

print "Number of files open after going crazy with open()", len(fds())

fixit = True
if fixit:
    # Cycle through all file descriptors opened by our process.
    for f in fds():
        fd = int(f)
        # Transmit the stds to future generations, mark the rest as close-on-exec.
        if fd > 2:  .
            try:
                fcntl.fcntl(fd, fcntl.F_SETFD, fcntl.FD_CLOEXEC)
            except IOError:
                # Some files can be closed between the time we list
                # the file descriptors and now. Most notably,
                # os.listdir opens the dir and it will probably be
                # closed by the time we hit that fd.
                pass

print "reloading"
python = sys.executable
os.execl(python, python, *sys.argv)

With this code, what I get on stdout are these 3 lines repeated until I kill the process:

Number of files open at start: 4
Number of files open after going crazy with open() 106
reloading

How the code works

The code above gets the list of open file descriptors through the fds() function. On a Linux system the file descriptors opened by a specific process are listed at:

/proc/<process id of the process we want>/fd

So if your process id of your process is 100 and you do:

$ find /proc/100/fd

You'll get a list like:

/proc/100/fd/0
/proc/100/fd/1
/proc/100/fd/2
[...]

The fds() function just gets the basename of all the these files ["0", "1", "2", ...]. (A more general solution might convert them to integers right away. I chose not to do that.)

The second key part is setting FD_CLOEXEC on all the file descriptors except std{in,out,err}. Setting FD_CLOEXEC on a file descriptor tells the operating system that next time exec is executed, the OS should close the file descriptor before giving control to the next executable. This flag is defined on the man page for fcntl.

In an application that uses threads that open files, it is possible for the code I have above to miss setting FD_CLOEXEC on some file descriptors if a thread executes between the time the list of file descriptors is obtained and the time exec is called and this thread opens new files. I believe the only way to ensure that this does not happen would be to replace os.open with code that calls the stock os.open and then set FD_CLOEXEC right away on the file descriptor returned.

Community
  • 1
  • 1
Louis
  • 146,715
  • 28
  • 274
  • 320
  • I think this is it, thanks. I ran your code with fixit false and got a ImportError: No module named os. Can you explain how this actually works or point me to some reading to understand this? Thankyou again – Martyn Dec 02 '13 at 09:22
2

Not a real answer, just a workaround for your actual problem: Have you considered starting a child process and if this terminate at once, then try starting another? This has some implications like an ever changing PID but maybe you can live with that.

Instead of

python = sys.executable
os.execl(python, python, * sys.argv)

you could use

import time, os

MONITOR_DURATION = 3.0
# ^^^ time in seconds we monitor our child for terminating too soon

python = sys.executable
while True:  # until we have a child which survived the monitor duration
  pid = os.fork()  # splice this process into two
  if pid == 0:  # are we the child process?
    os.execl(python, python, *sys.argv)  # start this program anew
  else:  # we are the father process
    startTime = time.time()
    while startTime + MONITOR_DURATION > time.time():
      exitedPid, status = os.waitpid(pid, os.WNOHANG)
      # ^^^ check our child for being terminted yet
      #     (without really waiting for it, due to WNOHANG)
      if exitedPid == pid:  # did our child terminate too soon?
        break
      else:  # no, nothing terminated yet
        time.sleep(0.2)  # wait a little before testing child again
    else:  # we survived the monitor duration without reaching a "break"
      break  # so we have a good running child, leave the outer loop
Alfe
  • 56,346
  • 20
  • 107
  • 159
  • I've never looked into this, could you point me towards some documentation/example code to do this? Thanks – Martyn Nov 28 '13 at 16:09
  • I sketched a piece of code which would do this. The main difference to what your version was doing is that my version _terminates_ after that restart. Another process (the child) then takes over. This could lead to problems if someone else is waiting for your process and supposed to never stop waiting. – Alfe Nov 29 '13 at 08:53
  • 1
    Thanks for that, code looks nice. I've integrated that into my restart function and will see how it goes. Cheers. – Martyn Nov 29 '13 at 09:06
  • Keep in mind that if your problem derives from an inherited resource problem (like too many open files), this method won't help at all. The child inherits all open file handles and then runs into the same problem as the `exec`ed (replaced) process. – Alfe Nov 29 '13 at 09:33
  • This fixed my non restarting error but caused my VM to crash, I assume this relates to having too many files open. – Martyn Dec 02 '13 at 09:23
  • Your VM crashed? You mean your underlying simulated hardware? That's odd. – Alfe Dec 02 '13 at 10:21
  • Well it was using 100% cpu and 100% memory. The script was trying to restart every minute and assume it was failing every time. The vm client crashed the host itself and other nodes were fine. – Martyn Dec 02 '13 at 10:24