16

I'm running a number of processes using multiprocessing.Pool

Each process has to query my mysql database.

I currently connect to the database once and then share the connection between the processes

It works but occasionally I get strange errors. I've confirmed that the errors are caused when querying the database.

I figured the problem is because the same connection is used for all the processes.

  • Is this correct?

As I looked for an answer I stumbled upon this q&a How to share a single MySQL database connection between multiple processes in Python

So I looked up Class pooling.MySQLConnectionPool

If I understand this. I'll set up a pool with a number of connections and share the pool between processes. Each process will then look into that pool and if a connection is available use it or else wait until a connection is freed.

  • Is this correct?

But then I found this q&a Accessing a MySQL connection pool from Python multiprocessing

It seems first that "mata" confirms what I suspected but at the same time he dismisses the use of setting up a pool to be shared between processes

sharing a database connection (or connection pool) between different processes would be a bad idea (and i highly doubt it would even work correctly),

Instead he suggests

so each process using it's own connections is actually what you should aim for.

What does that mean?

  • Should I create a single connection for each worker? Then what are mysql pools good for?

The example given by mata in his answer seems reasonable enough but I don't understand the passing of the entire pool as the init argument

p = Pool(initializer=init)
  • Why? (As ph_singer points out in the comments this is not a good solution)

Changing the blocking Pool.map() method to Pool.map_async() and sending a connection from the pool to the map_async(q, ConnObj) should suffice?

  • Is this correct?

In the comments it's mentioned that

The only way of utilizing one single pool with many processes is having one dedicated process which does all the db access communicate with it using a queue

UPDATE Found this. Seems to agree: https://stackoverflow.com/a/26072257/1267259

If you need large numbers of concurrent workers, but they're not using the DB all the time, you should have a group of database worker processes that handle all database access and exchange data with your other worker processes. Each database worker process has a DB connection. The other processes only talk to the database via your database workers.

Python's multiprocessing queues, fifos, etc offer appropriate messaging features for that.

  • Is this really correct?

Isn't the purpose of a mysql pool to handle requests by the processes and relay them to a available connection?

Now I'm just confused...

Community
  • 1
  • 1
user1267259
  • 761
  • 2
  • 10
  • 22
  • 1
    Have you read [the documentation](http://dev.mysql.com/doc/connector-python/en/connector-python-connection-pooling.html) about this? Basically, the idea is to create a connection pool in the main process, and then in each spawned thread/process, you request connections from that pool. Threads should not share the same identical connection, because then threads can block each other from one of the major activities that threading is supposed to help with: IO. – ely Feb 20 '15 at 22:15
  • Another perfectly fine method is to place the code that requests a connection only in the thread/process-specific section of code, so that each thread gets it's own connection (and the implicit "pool" in this case is the actual database driver itself, which supports concurrent connections). Especially for embarrassingly parallel problems where you just need a quick connection to query data independent of other processes/threads, this can make sense. What you lose, though, is explicit control of the total number of connections and (not relevant for MySQL with ISAM) transaction management. – ely Feb 20 '15 at 22:21
  • @Mr.F _"Have you read the documentation about this? Basically, the idea is to create a connection pool in the main process, and then in each spawned thread/process, you request connections from that pool."_ Yes and that's how I thought I understood it until I read mata's answer. Also when you say _"Threads should not share the same identical connection, because then threads can block each other from one of the major activities that threading is supposed to help with: IO."_ You're refering to my first subquestion right? Not your previous sentence? – user1267259 Feb 20 '15 at 22:39
  • Yes, that is right. The answer from "mata" on the linked question describes a third way: create an entirely new pool in each of the threads/processes, and make requests of that pool only within that thread/process. I believe the concern "mata" has is that if a pool object is created in one process, and shared with others, that somehow the database IO must flow from the process through the main process with the pool to the database, the back to the main process, then back to the original process that requested the connection. But this disagrees with the second bullet point in the linked docs. – ely Feb 20 '15 at 22:53
  • ["A pool opens a number of connections and handles thread safety when providing connections to requesters."](http://dev.mysql.com/doc/connector-python/en/connector-python-connection-pooling.html) -- I would be quite surprised if this is different when speaking about **processes** instead of *threads*, but perhaps it is and perhaps that's the exact distinction "mata" was making ... I don't know the answer and would have assumed that by 'thread safety' it would also mean 'process safety' since that's easier. – ely Feb 20 '15 at 22:55
  • @Mr.F Re "the third way" then why not pass to init an "ordinary" connection (ie "connector.connect" and not a connector.pooling.MySQLConnectionPool)? Each process only has use for one connection at a time. /// I'm no expert but I understand the second bullet as the connector.pooling is using threads it is thread safe. Or am I'm getting your answer wrong? // Re "Another perfectly fine method....." you mean eg `map_async(q, pool.get_connection())` and `def q(conn): conn.cursor() .....`? – user1267259 Feb 20 '15 at 23:17
  • your comments suggest a basic misunderstanding of what `multiprocessing` does. In short, listen to mata, ignore ph_singer. There is no meaningful difference between `.map()` and `.map_async().get()`. Start with a single db connection per process. If you think you need significantly more processes than CPUs or you have 100s-1000s CPUs -- describe in detail what you are trying to do (you should not need it in most cases). – jfs Feb 22 '15 at 00:44
  • @J.F.Sebastian yes I'm having trouble with mp (you've seen my other questions ;) But I'm not sure where this came from *"There is no meaningful difference between .map() and .map_async().get()"* I'm not sure where i made a difference between them...? Anyhow, in my case we're talking about no more than 3-6 processes so yes a single connection per process will work for me. But then you say listen to mata, and mata says pool per processes, But why use pool per process instead of one connection per process? (continuing in next comment) – user1267259 Feb 22 '15 at 01:47
  • @J.F.Sebastian Ie what is the difference between using one connection per worker and using a pool of connections per worker? Is it because as Mr.F says? The only thing I can think of is if each worker needs to do successive queries to the db (otherwise it will block until the query to the db is finished) ie pool can speed up querying in each childprocess. – user1267259 Feb 22 '15 at 01:48
  • 1
    *"why use [db connections] pool"* -- if there are multiple *threads* in your worker process then the pool might be useful (several threads can read/write data in parallel (CPython can release GIL during I/O)). If there is only one thread per worker process then there is no point to use the db pool. – jfs Feb 22 '15 at 02:05
  • @J.F.Sebastian I see. Then I'd assume you agree with Tata and the following *"Basically, the idea is to create a connection pool in the main process, and then in each spawned process, you request connections from that pool."* Is wrong? Ie don't pass around the pool between processes? Which would then answer my last question as well. – user1267259 Feb 22 '15 at 02:24
  • 1
    @user1267259: no. Here's come the basic misunderstanding that I've mentioned earlier: *"don't pass around the pool between processes"* -- most pool implementations **can't** be passed around (and they **should not**). Start with a single db connection per process (or a pool of db connections that you create *inside* a worker process if there are multiple db-interacting threads in it) -- *if* it breaks then describe your *specific* situation exactly. – jfs Feb 22 '15 at 02:49
  • @J.F.Sebastian Forgive me for being dense but isn't there a difference between *passing a pool between processes* and *passing a connection from a pool between processes*? Eg `cnxpool = mysql.connector.pooling.MySQLConnectionPool(....) # main process (I can't create a new line!) cnx = cnxpool.get_connection() # pass cnx to worker which cnx.close() it when done` – user1267259 Feb 22 '15 at 11:39
  • 2
    @user1267259: yes. There is a difference. But It doesn't matter. You should not do neither. You pass nothing. Each child process creates its own db connections if it needs them (either individually or as a pool). – jfs Feb 22 '15 at 11:56

1 Answers1

11

Found Share connection to postgres db across processes in Python
The answer to my first question seems to be

You can't sanely share a DB connection across processes like that. You can sort-of share a connection between threads, but only if you make sure the connection is only used by one thread at a time. That won't work between processes because there's client-side state for the connection stored in the client's address space.

The answer to my remaining questions basically boils down to which of the following statements you go with (from the discussion in the comments in this q&a)

Basically, the idea is to create a connection pool in the main process, and then in each spawned thread/process, you request connections from that pool. Threads should not share the same identical connection, because then threads can block each other from one of the major activities that threading is supposed to help with: IO. – Mr. F

or

neither pass the pool or a connection from the pool to childprocesses

Each child process creates its own db connections if it needs them (either individually or as a pool) – J.F. Sebastian.

and

"why use [db connections] pool" -- if there are multiple threads in your worker process then the pool might be useful (several threads can read/write data in parallel (CPython can release GIL during I/O)). If there is only one thread per worker process then there is no point to use the db pool. – J.F. Sebastian


As a side note

This doesn't exactly answer my third question but it does actually present creating a connection per process as feasible in some cases (Share connection to postgres db across processes in Python)

It's unclear what you're looking for here. 5 connections certainly isn't an issue. Are you saying you may eventually need to spawn 100s or 1000s of processes, each with their own connection? If so, even if you could share them, they'd be bound to the connection pool, since only one process could use a given connection at any given time. – khampson Sep 27 '14 at 5:19

Community
  • 1
  • 1
user1267259
  • 761
  • 2
  • 10
  • 22
  • Awesome summary. About the last quote, I don't think it's correct. From the last quote: "even if you could share them, they'd be bound to the connection pool, since only one process could use a given connection at any given time" This can't be correct. Every process creates its own pool, and connection acquisitions in the pool of one process are not visible to processes. So two processes can write requests at the same time and corrupt each other. – Marius Nov 07 '17 at 09:19
  • The connections pool seems, at least from [psycopg](http://initd.org/psycopg/docs/pool.html#psycopg2-pool-connections-pooling) docs, to exist in order to share connections between threads. As it says, creating connections can be an expensive task, thus one would avoid this cost by using a pool. It looks like a good solution for sharing connections between processes, BUT the docs also says that connections [shouldn’t be used by a forked processes](https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-CONNECT). Then I would say the best solution would be the malta's one. – leoschet Jun 12 '18 at 13:34
  • What approach did you follow? Any con on malta's (without pools) solution? – leoschet Jun 12 '18 at 13:36