1

The following class Downloader is supposed to query a SQL database multiple times and store the results in a list of pandas.DataFrame objects.

I would like to use multiprocessing to speed up the retrieval, however I get the error

    line 53, in run_queries
    dfs_queries = p.map(run_query, queries)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

I have looked into this question which suggests that the pyodbc connection and cursor objects can not be pickled. Is there a way to still use the pool.map(f, arglist) from multiprocessing when f is reliant on a sql connection?

import pyodbc
from multiprocessing import Pool as ThreadPool
import pandas as pd

class Downloader(object):

    def _connect(self, path_db_config):
        # ... Loads a config file from which it gets dsn, user and password ... #

        con_string = 'DSN=%s;UID=%s;PWD=%s;' % (dsn, user, password)
        return pyodbc.connect(con_string)

    def run_queries(self):
        queries = [# List of sql queries #]
        p = ThreadPool(len(queries))

        def run_query(query):
            cnxn = self._connect(PATH_DB_CONFIG)
            df = pd.read_sql(query, cnxn)
            return df

        return p.map(run_query, queries)

Thanks for the help!!

tenticon
  • 2,639
  • 4
  • 32
  • 76
  • Whether or not there is a way around the cursor issue I'm not sure but you are also trying to use multiprocessing - which relies on pickling - on a nested function; [you cannot do that](https://stackoverflow.com/questions/12019961/python-pickling-nested-functions) – roganjosh Aug 29 '17 at 14:54
  • but the it should only pickle `run_query` no? and there is no nested method in there – tenticon Aug 29 '17 at 15:11
  • `run_query` _is_ nested though. – roganjosh Aug 29 '17 at 15:12
  • ok will keep that in mind. I think I need to extend `Process` and put both the sql connection and query functionality into the `run()` method. Then I loop through the queries, instantiating a process for each query and putting the result into a `multiprocessing.Queue` where I can retrieve it from later – tenticon Aug 29 '17 at 15:18
  • You may want to look into something like [this](https://stackoverflow.com/questions/9470403/pickling-class-method). Your proposed approach doesn't seem IIUC to address the issue. Basically, the function you want to pass to multiprocessing needs to be accessible in the module namespace. – roganjosh Aug 29 '17 at 15:26

0 Answers0