1

I've encountered some strange application behaviour while interacting with database using many processes. I'm using Linux.

I have my own implementation of QueryExecutor which uses the a single connection during its lifetime:

class QueryExecutor(object):
    def __init__(self, db_conf):
        self._db_config = db_conf
        self._conn = self._get_connection()

    def execute_query(self, query):
        # some code
    # some more code

def query_executor():
    global _QUERY_EXECUTOR
    if _QUERY_EXECUTOR is None:
        _QUERY_EXECUTOR = QueryExecutor(some_db_config)
    return _QUERY_EXECUTOR

Query Executor is never modified after instantiation.

Initially there is only one process, which from time to time forks (os.fork()) several times. The new processes are workers which do some tasks and then exit. Each worker calls query_executor() to be able to execute a SQL query.

I have found out that sql queries often return wrong results (it seems that sometimes sql query result is returned to the wrong process). The only sensible explanation is all processes share the same sql connection (according to MySQLdb doc: threadsafety = 1 Threads may share the module, but not connections).

I wonder which OS mechanism leads to this situation. As far as I know, on Linux when process forks, the parent process's pages are not copied for the child process, they are shared by both processes until one of them tries to modify some page (copy-on-write). As I have mentioned before, QueryExecutor object remains unmodified after creation. I guess this is the reason for the fact that all processes uses the same QueryExecutor instance and hence the same sql connection.

Am I right or do I miss something? Do you have any suggestions?

Thanks in advance!

Grzegorz

Grzes
  • 971
  • 1
  • 13
  • 28

1 Answers1

4

The root of the problem is that fork() simply creates an exact independent copy of a process, but these two processes share opened files, sockets and pipes. That's why any data written by MySQL server may be [correctly] read only from a single process and if two processes try to make requests and read responses then they quite likely will mess up each other work. This has nothing with "multithreading" because in case of multi-threading there's a single process with few threads of executions, they share data and may coordinate.

The correct way to use fork() is to close (or re-open) right after forking all file-handle-like objects in all but one copies of the process or at least avoid using them from multiple processes.

Community
  • 1
  • 1
user3159253
  • 16,836
  • 3
  • 30
  • 56
  • is it a good idea if I declare in `QueryExecutor` field `multiprocessing.Lock` and use it to synchronize usage of method `execute_query`? Will all forked processes see *the same* lock? In `multiprocessing` docs the lock is passed to new process as an argument `Process(target=f, args=(lock, num)).start()`, but I cannot do that since `os.fork()` is done in external library. – Grzes Apr 30 '14 at 12:27
  • 1
    Sorry, tapped prematurely. Likely multiprocessing.Lock would help in case when processes write to a stream simultaneously, and you wish to clearly separate one "packet" of data (e.g. a line of output) from another. But if you deal with a two-way communication like SQL requests/responses session then it's not a solution, because responses won't be connected one-to-one to requests. Thus I would suggest to forcibly close and then re-open QueryExecutor connection in a child process. – user3159253 Apr 30 '14 at 13:06
  • OK, I understand, but what is wrong if the lock is held when executing: `cur = self._conn.cursor(); cur.execute(query); cur.close(); return result` ? – Grzes Apr 30 '14 at 13:16
  • Ah, this way it probably will work. But be prepared that your processes will wait for each other most of the time. – user3159253 Apr 30 '14 at 13:25