python multiprocessing - Best way to initialize/pass database connection to be used across processes

Question

I am having some difficulty in passing a database connection object or the cursor object using pool.map in the Python multiprocesing package. Basically, I want to create a pool of workers each with its own state and a db connection, so that they can execute queries in parallel.

I have tried these approaches, but I am getting a picklingerror in python with them -

Pool Map with 2 arugements

Use Initializer to set up multiprocess pool

The second link is exactly what I need to do, meaning I'd like each process to open a database connection when it starts, then use that connection to process the data/args that are passed in.

Here is my code.

import multiprocessing as mp

def process_data((id,db)):
  print 'in processdata'
  cursor = db.cursor()
  query = ....
  #cursor.execute(query)
  #....
  .....
  .....
  return row

`if __name__ == '__main__':

  db = getConnection() 
  cursor = db.cursor() 
  print 'Initialised db connection and cursor'
  inputs = [1,2,3,4,5]
  pool = mp.Pool(processes=2)
  result_list = pool.map(process_data,zip(inputs,repeat(db)))
  #print result_list
  pool.close()
  pool.join()

`

This results in the following error -

`Exception in thread Thread-1:
  Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.6/threading.py", line 484, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.6/multiprocessing/pool.py", line 225, in _handle_tasks
    put(task)
  PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed`

I guess the db or the cursor object is not picklable according to python, because if I replace repeat(db), to repeat(x) where x is an int or string , it works. I have tried using the initializer function and it seems to work, initially but weird things happen when I execute queries, many return nothing for an id, when there is data present.

What would be the best way to achieve this? I am using python 2.6.6 on a linux machine.

You ... probably don't want to pass database connections. They're not serializable. If your children need access to the database they'll probably need to establish their own connection. — g.d.d.c, Sep 27 '12 at 04:38

score 10 · Accepted Answer · answered Sep 27 '12 at 05:59

I'm going to go ahead and put my comment up as an answer, because I think it's appropriate as one. You don't want to try to pass database connections from your parent process to your children processes. You want to move static data or other objects that can be serialized to your children processes. You can pass rows of data, etc. Or you want to have your children establish their own database connections when they become necessary.

Northlondoner · Answer 2 · 2012-09-27T07:42:16.767

-2

Try pickling the db connection object. Pickling is independent of processes. So it might work..

Ref these pages - python pickle
pickle examples

edited Sep 27 '12 at 07:42

answered Sep 27 '12 at 07:35

Northlondoner

27
2

1

pickle generally throws a fit whenever a db connection is passed in (I have that happen numerous times, you need to get it out of the save-state) so seems like an unwise choice. – JL Peyret Oct 14 '15 at 21:19

python multiprocessing - Best way to initialize/pass database connection to be used across processes

2 Answers2