I am having some difficulty in passing a database connection object or the cursor object using pool.map in the Python multiprocesing package. Basically, I want to create a pool of workers each with its own state and a db connection, so that they can execute queries in parallel.
I have tried these approaches, but I am getting a picklingerror in python with them -
Use Initializer to set up multiprocess pool
The second link is exactly what I need to do, meaning I'd like each process to open a database connection when it starts, then use that connection to process the data/args that are passed in.
Here is my code.
import multiprocessing as mp
def process_data((id,db)):
print 'in processdata'
cursor = db.cursor()
query = ....
#cursor.execute(query)
#....
.....
.....
return row
`if __name__ == '__main__':
db = getConnection()
cursor = db.cursor()
print 'Initialised db connection and cursor'
inputs = [1,2,3,4,5]
pool = mp.Pool(processes=2)
result_list = pool.map(process_data,zip(inputs,repeat(db)))
#print result_list
pool.close()
pool.join()
`
This results in the following error -
`Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.6/multiprocessing/pool.py", line 225, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'module'>: attribute lookup __builtin__.module failed`
I guess the db or the cursor object is not picklable according to python, because if I replace repeat(db), to repeat(x) where x is an int or string , it works. I have tried using the initializer function and it seems to work, initially but weird things happen when I execute queries, many return nothing for an id, when there is data present.
What would be the best way to achieve this? I am using python 2.6.6 on a linux machine.