1

I'm converting a program to multiprocessing and need to be able to log to a single rotating log from the main process as well as subprocesses. I'm trying to use the 2nd example in the python cookbook Logging to a single file from multiple processes, which starts a logger_thread running as part of the main process, picking up log messages off a queue that the subprocesses add to. The example works well as is, and also works if I switch to a RotatingFileHandler.

However if I change it to start logger_thread before the subprocesses (so that I can log from the main process as well), then as soon as the log rotates, all subsequent logging generates a traceback with WindowsError: [Error 32] The process cannot access the file because it is being used by another process.

In other words I change this code from the 2nd example

workers = []
for i in range(5):
    wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
    workers.append(wp)
    wp.start()
logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()

to this:

logging.config.dictConfig(d)
lp = threading.Thread(target=logger_thread, args=(q,))
lp.start()
workers = []
for i in range(5):
    wp = Process(target=worker_process, name='worker %d' % (i + 1), args=(q,))
    workers.append(wp)
    wp.start()

and swap out logging.FileHandler for logging.handlers.RotatingFileHandler (with a very small maxBytes for testing) and then I hit this error.

I'm using Windows and python 2.7. QueueHandler is not part of stdlib til python 3.2 but I've copied the source code from Gist, which it says is safe to do.

I don't understand why starting the listener first would make any difference, nor do I understand why any process other than main would be attempting to access the file.

fantabolous
  • 21,470
  • 7
  • 54
  • 51

2 Answers2

3

You should never start any threads before subprocesses. When Python forks, the threads and IPC state will not always be copied properly.

There are several resources on this, just google for fork and threads. Some people claim they can do it, but it's not clear to me that it can ever work properly.

Just start all your processes first.

Example additional information:

Status of mixing multiprocessing and threading in Python

https://stackoverflow.com/a/6079669/4279

In your case, it might be that the copied open file handle is the problem, but you still should start your subprocesses before your threads (and before you open any files that you will later want to destroy).

Some rules of thumb, summarized by fantabolous from the comments:

  • Subprocesses must always be started before any threads created by the same process.

  • multiprocessing.Pool creates both subprocesses AND threads, so one mustn't create additional Processes or Pools after the first one.

  • Files should not already be open at the time a Process or Pool is created. (This is OK in some cases, but not, e.g. if a file will be deleted later.)

  • Subprocesses can create their own threads and processes, with the same rules above applying.

  • Starting all processes first is the easiest way to do this

Community
  • 1
  • 1
Patrick Maupin
  • 8,024
  • 2
  • 23
  • 42
  • Thank you - had no idea. I suppose that also means I shouldn't be creating the processes from within a non-main thread as I am now ;) Your last comment "should start your subprocesses before ... you open any files that you will later want to destroy" will be tricky. Does this apply even to files that the subprocesses will never use/touch? – fantabolous Aug 20 '15 at 01:46
  • 1
    It's only files that you have open when you start the subprocess. Opening, then closing immediately is no problem. But, you can also close open file handles as part of the subprocess. Perhaps consider doing a lot less in your main process, and moving even the logging to a subprocess. – Patrick Maupin Aug 20 '15 at 01:48
  • Ok, moving logging to a subprocess shouldn't be difficult. Currently I open a large datafile on start to determine the list of tasks to send to my process pool, and later based on the results sent back by the processes I update the same file in the main process. I suppose I should initialize the pool at the very start, and then later sending `pool.apply_async` should be safe? Would this mean I'd be restricted to using a pool, e.g. I can't start `multiprocessing.Process` at arbitrary times to fire off new tasks? – fantabolous Aug 20 '15 at 02:35
  • 1
    Sorry, I don't know multiprocessing internals, but yeah, if its [async processing threads](http://stackoverflow.com/a/24771322/3577601) are still running, I wouldn't create additional processes. Maybe the main process should create a proxy process before it builds its pool, and then use the proxy process to create any additional one-off processes later, or maybe you move the pool creation itself into a subprocess. – Patrick Maupin Aug 20 '15 at 02:52
  • 1
    It might be cleanest to have a "no threads" rule in the top level process, which would mean that you don't use multiprocessing.pool there. – Patrick Maupin Aug 20 '15 at 02:53
  • Ic. So in summary.. 1. Processes must always start before any threads created by the same process. 2. `multiprocessing.Pool` creates both processes AND threads, so one mustn't create additional Processes or Pools after the first one. 3.Subprocesses can create their own threads and processes, with the same rules above applying. 4. Files must not already be open at the time a Process or Pool is created. – fantabolous Aug 20 '15 at 03:44
  • 1
    That's the good rule of thumb. In reality, you shouldn't have any threads except the main one running when you start a process; starting all processes first is the easy way to achieve this. – Patrick Maupin Aug 20 '15 at 03:53
  • Really appreciate the help - it was actually much easier to move my process creation to the start than I had expected, and it works like I charm now. (I attempted to add some of your points into your answer so other people with similar questions wouldn't need to read through our comments, but I guess that's not a valid edit!) – fantabolous Aug 20 '15 at 09:12
2

So, you can simply make your own file log handler. I have yet to see logs getting garbled from multiprocessing, so it seems file log rotation is the big issue. Just do this in your main, and you don't have to change any of the rest of your logging

import logging
import logging.handlers
from multiprocessing import RLock

class MultiprocessRotatingFileHandler(logging.handlers.RotatingFileHandler):
    def __init__(self, *kargs, **kwargs):
        super(MultiprocessRotatingFileHandler, self).__init__(*kargs, **kwargs)
        self.lock = RLock()

    def shouldRollover(self, record):
        with self.lock:
            super(MultiprocessRotatingFileHandler, self).shouldRollover(record)

file_log_path = os.path.join('var','log', os.path.basename(__file__) + '.log')
file_log = MultiprocessRotatingFileHandler(file_log_path,
                                           maxBytes=8*1000*1024,
                                           backupCount=5,
                                           delay=True)

logging.basicConfig(level=logging.DEBUG)
logging.addHandler(file_log)

I'm willing to guess that locking every time you try to rotate is probably slowing down logging, but then this is a case where we need to sacrifice performance for correctness.

Stefan Sullivan
  • 1,517
  • 2
  • 10
  • 9