Python - using a process pool with an unpicklable object

Question

Let's assume I have an object that can't be pickled, but I'm running my code on linux and would like to use the advantages of fork.

If I'm running my code from the interpreter, it looks like this:

from multiprocessing.pool import Pool

# large object that takes time to calculate - and is not picklable
large_obj = get_large_obj()

def some_func(c):
    return large_obj.do_something_with_int(c)

pool = Pool(64)
pool.map(some_func,[1,2,3,4,5,6,7,.....,10000])

This will work if I run it in linux, since all process would "know" the large_obj

My issue is - if I want to have this code snippet inside a class, or just inside a method, it wouldn't work, since it will fail to pickle my function, since it will be an unbound method.

I don't want to use a different process pool library (let's say loky), since my large object is not picklable, even with cloudpickle for example (or dill).

And I don't want to recalculate the large object for each new process. I would like to calculate it only once and use it inside the pool.

Is there any solution for that?

Have you checked this - https://stackoverflow.com/questions/3671666/sharing-a-complex-object-between-python-processes ? — Shiva, Dec 26 '19 at 13:30
You say “it will fail to pickle my function”—but what function is that? You might need an example of what doesn’t work. — Davis Herring, Dec 26 '19 at 20:02
about using managers - i might be missing something, but it still needs to pickle the object, no? it will fail to pickle an unbound method, so if i want to run this code inside a class method, inside that method ill have to define the "some_func" method, since i define it inside a class method, it will be unbound, and wont be pickable — tamirg, Dec 29 '19 at 09:05

user3666197 · Answer 1 · 2019-12-30T09:46:34.517

In short, you've entered a fight of costs-of-[TIME] against costs-of-[SPACE] ( avoiding replicated copies ).

The pickle-related phase ( being a blocker here for the standard python way, how the process-instantiation(s), not solving the problem per se, take place ) is not important in this global view.

Avoiding pickle / replication simply means you have to pay for another form of concurrency controls, over "shared" ( non-replicable ) resource - the large_obj class-wrapper.

Q : Is there any solution for that?

Yes.

One of the possible solutions may be to design and start operating a smart-enough distributed-computing system architecture, where the large_obj gets calculated once and where it's wrapper-class can concurrently ( not a true-[PARALLEL] operation ) respond to "remote"-processes' requests ( the processes being colocated on a same host or distributed round the globe ).

Going towards the goal, I would start using properly tuned ZeroMQ "mediation" infrastructure, where all of { inproc:// | ipc:// | tipc:// | tcp:// | norm:// | pgm:// | epgm:// | vmci:// } transport-classes may and can coexist inside the same infrastructure, allowing computing agents to be at the same time both inside the tight-colocation zone ( on the same host, where needed for maximum proximity and minimum latency ) and/or across a wide-network of interconnected distributed agents ( on remote hosts, where performance-scaling requires more processing powers than an individual localhost platform can provide ).

Resulting schema will harness as many as needed [SPACE]-wise "lightweight"-ZeroMQ-equipped-processes ( processing Agents ), running inside a whatever form of a just-concurrent-Pool ( where actual hardware and/or grid-computing resources still limit the scope of possible true-[PARALLEL] code-executions, thus a "just"-[CONCURRENT] code-execution is a more appropriate in terms of rigorous Computer Science discipline ). Each such process-instance may live on its own and ad-hoc submit a request towards a "remote-shared-service-Agent" ( who indeed owns the only one instance of the large_obj, including all there needed methods, including ZeroMQ communication-handling facilities, where incoming requests from remote-agents will get efficiently processed without pickling, without excessive memory-transfers of large_obj-replica(s), having avoided any form of locking and blocking ( yet still both performance scaling and latency-tuning are possible ) and returns "computed"-replies (right-sized, small and efficient answers) back to respective remote-agents, as needed ).

first of, thanks for the reply, to be honest, no sure i fully understand, i know i can use spark for example. but i was wondering if there was a simple solution using a python process pool. reading more and more about this issue, im starting to believe the answer is no :) again, the code i posted works, my main issue is i can use it inside a class method code. (i was simply using the fact linux uses fork and all my process could access the large object) — tamirg, Dec 29 '19 at 09:14
*fork*-based instantiation is cheap, until you need to modify the content or boost performance.For R/O-views advantages go in your direction. However,the Class-instances have private data, evangelised by the OOD-imperatives to publish methods' interfaces, not any access to instance's internal-state data. So *large_obj* may reside in fork-able memory, yet your code shall not touch it but via instance-method(s) interface and these accesses get sub-ordinated "under" the python central ***monopolist-problem*** The GIL-locking, that all threads, the more the ***shared*** class-instance methods obey — user3666197, Dec 29 '19 at 09:23

Python - using a process pool with an unpicklable object

1 Answers1