0

I finally understood example how to replace pickle with dill from the following discussion: pickle-dill. For example, the following code worked for me

import os
import dill
import multiprocessing

def run_dill_encoded(what):
    fun, args = dill.loads(what)
    return fun(*args)

def apply_async(pool, fun, args):
    return pool.apply_async(run_dill_encoded, (dill.dumps((fun, args)),))

if __name__ == '__main__':

    pool = multiprocessing.Pool(5)
    results = [apply_async(pool, lambda x: x*x, args=(x,)) for x in range(1,7)]
    output = [p.get() for p in results]
    print(output)

I tried to apply the same philosophy to pymongo. The following code

import os
import dill
import multiprocessing
import pymongo

def run_dill_encoded(what):
    fun, args = dill.loads(what)
    return fun(*args)


def apply_async(pool, fun, args):
    return pool.apply_async(run_dill_encoded, (dill.dumps((fun, args)),))


def write_to_db(value_to_insert):           
    client = pymongo.MongoClient('localhost',  27017)
    db = client['somedb']
    collection = db['somecollection']
    result = collection.insert_one({"filed1": value_to_insert})
    client.close()

if __name__ == '__main__':
    pool = multiprocessing.Pool(5)
    results = [apply_async(pool, write_to_db, args=(x,)) for x in ['one', 'two', 'three']]
    output = [p.get() for p in results]
    print(output)

produces error:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Python34\lib\multiprocessing\pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "C:\...\temp2.py", line 10, in run_dill_encoded
    return fun(*args)
  File "C:\...\temp2.py", line 21, in write_to_db
    client = pymongo.MongoClient('localhost',  27017)
NameError: name 'pymongo' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/.../temp2.py", line 32, in <module>
    output = [p.get() for p in results]
  File "C:/.../temp2.py", line 32, in <listcomp>
    output = [p.get() for p in results]
  File "C:\Python34\lib\multiprocessing\pool.py", line 599, in get
    raise self._value
NameError: name 'pymongo' is not defined

Process finished with exit code 1

What is wrong?

Community
  • 1
  • 1
user1700890
  • 7,144
  • 18
  • 87
  • 183
  • 1
    Hi, I 'm the `dill` author. Looks like you don't define `pymongo` inside your function. Try putting the `import pymongo` inside `write_to_db`. The function will serialize much better (or at all, sometimes) if you make sure all variables used in the function are defined locally. – Mike McKerns Apr 21 '16 at 02:42
  • 1
    Also, there's an easier way to use `dill` in `multiprocessing`. Try the `multiprocess` module -- it's `multiprocessing` but with `pickle` replaced by `dill`. – Mike McKerns Apr 21 '16 at 02:44
  • @MikeMcKerns, thank you very much! It worked. I am still working on compiling `multiprocess` for Python 3.x. By the way, it there analog of `apply_async` for threads? – user1700890 Apr 21 '16 at 13:37
  • 1
    Have a look at `from multiprocessing import dummy` and then `p = dummy.Pool()` and `p.apply_async`. It's threads, but using the process API. – Mike McKerns Apr 21 '16 at 13:54
  • I am afraid if I use `p.apply_async` it will use wrong `apply_async`. I need to make sure that redefined `apply_async` is called. I am testing it right now. – user1700890 Apr 21 '16 at 14:17
  • 1
    all you need to do is pass in the correct pool to your function, in this case, it's `p = dummy.Pool()`. – Mike McKerns Apr 21 '16 at 14:22
  • Thank you again! It worked! – user1700890 Apr 21 '16 at 14:37
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109829/discussion-between-user1700890-and-mike-mckerns). – user1700890 Apr 21 '16 at 14:55

1 Answers1

1

As I mentioned in the comments, you need to put an import pymongo inside the function write_to_db. This is because when the function is serialized, it does not take along any of the global references with it when it is shipped to the other process space.

Mike McKerns
  • 33,715
  • 8
  • 119
  • 139