0

I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.

I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.

cores = []
for i in range(os.cpu_count()):
    cores.append(Process(target=processImages, args=(dataSets[i],))) 
for core in cores: 
    core.start()
for core in cores:
    core.join()

Where the function 'processImages' returns a value. How do I save the returned value?

spchee
  • 61
  • 3

3 Answers3

0

In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.

cpu_count == dataset length ?

The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.

An aside about CPU-bound work

I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.

Here's more reading if you like

Pool class

Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).

TLDR

Use those simple multiprocessing concepts like this:

 from multiprocessing import Pool

 # Renaming this variable just for clarity of the example here
 work_queue = datasets

 # This is the number you might want to find experimentally. Or just run with cpu_count()
 worker_count = os.cpu_count()

 # This will create processes (fork) and join all for you behind the scenes
 worker_pool = Pool(worker_count)

 # Farm out the work, gather the results. Does not care whether dataset count equals cpu count
 processed_work = worker_pool.map(processImages, work_queue)

 # Do something with the result
 print(processed_work)
Bill Huneke
  • 746
  • 4
  • 12
0

If you want to use the result object returned by a multiprocessing, try this

from multiprocessing.pool import ThreadPool


def fun(fun_argument1, ... , fun_argumentn):
    <blabla>
    return object_1, object_2


pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()

then you can do whatever you want.

Sean.H
  • 640
  • 1
  • 6
  • 18
  • @lenik I thought, if what spchee want is get and store the result(s) returned from a function, then `multiprocessing.pool.ThreadPool` maybe not a bad way to go. – Sean.H Nov 07 '19 at 03:30
  • if his tasks are CPU bound, multithreading will just slow things down, instead of running them in parallel, like multiprocessing does. which kind of defeats the purpose. – lenik Nov 07 '19 at 03:35
  • I agree with you. `multiprocessing.pool.ThreadPool` only good at `IO bound jobs`. – Sean.H Nov 07 '19 at 03:38
0

You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.

If you just need a single number -- using Value or Array could be easier.

Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.

lenik
  • 23,228
  • 4
  • 34
  • 43