3

I have a situation in multiprocessing where the list I use to collect the results from my function is not getting updated by the process. I have two examples of code, one which updates the list correction: the code updated properly using 'Thread', but fails when using 'Process', and one which does not. I cannot detect any kind of error. I think this might be a subtlety of scope that I don't understand.


Here is the working example: correction: this example does not work either; works with threading.Thread, however.

def run_knn_result_wrapper(dataset,k_value,metric,results_list,index):
    results_list[index] = knn_result(dataset,k_value,metric)

results = [None] * (k_upper-k_lower)
threads = [None] * (k_upper-k_lower)
joined = [0] * (k_upper-k_lower)

for i in range(len(threads)):
    threads[i] = Process(target=run_knn_result_wrapper,args=(dataset,k_lower+i,metric,results,i))
    threads[i].start()
    if batch_size == 1:
        threads[i].join()
        joined[i]=1
    else:

        if i % batch_size == batch_size-1 and i > 0:
            for j in range(max(0,i - 2),i):
                if joined[j] == 0:
                    threads[j].join()
                    joined[j] = 1
for i in range(len(threads)):
    if joined[i] == 0:
        threads[i].join()


Ignoring the "threads" variable name (this started on threading, but then I found out about the GIL), the `results` list updates perfectly.  

Here is the code which does not update the results list:

def prediction_on_batch_wrapper(batchX,results_list,index):
        results_list[index] = prediction_on_batch(batchX)



batches_of_X = np.array_split(X,10)

overall_predicted_classes_list = []
for i in range(len(batches_of_X)):
    batches_of_X_subsets = np.array_split(batches_of_X[i],10)
    processes = [None]*len(batches_of_X_subsets)
    results_list = [None]*len(batches_of_X_subsets)
    for j in range(len(batches_of_X_subsets)):
        processes[j] = Process(target=prediction_on_batch_wrapper,args=(batches_of_X_subsets[j],results_list,j))
    for j in processes:
        j.start()
    for j in processes:
        j.join()
    if len(results_list) > 1:
        results_array = np.concatenate(tuple(results_list))
    else:
        results_array = results_list[0]

I cannot tell why, within Python's scope rules the results_list list does not get updated by the prediction_on_batch_wrapper function.

A debugging session reveals that the results_list value inside the prediction_on_batch_wrapper function does, in fact, get updated...but somehow, it's scope is local on this second python file, and global on the first...


What is going on here?

Chris
  • 28,822
  • 27
  • 83
  • 158
  • 1
    You understand the difference between thread and process, and why they use ``manager = Manager()`` and ``return_dict = manager.dict()`` in the answer for the question you mentioned? : http://stackoverflow.com/questions/10415028/how-can-i-recover-the-return-value-of-a-function-passed-to-multiprocessing-proce – minhhn2910 Apr 02 '16 at 17:11
  • Nope, not at all :) But I think that now I do. – Chris Apr 02 '16 at 17:14
  • then try ``manager = Manager()`` and ``results_list = manager.list([None]*len(batches_of_X_subsets))`` for the second snippet if you still want to use process :D – minhhn2910 Apr 02 '16 at 17:18
  • Thank you. I will try that... – Chris Apr 02 '16 at 17:20

1 Answers1

3

This is because you are spawning another process - separate processes do not share any resources, and that includes memory.

Each process is a separate isolated running program, usually visible within Task Manager or ps. When you use Process to start an additional process, you should see a second instance of Python start when you spawn the process.

A thread is another execution point within your main process, and shares all of the resources of the main process even across multiple cores. All threads within a process are capable of seeing any part of the overall process, although how much they can use depends on the code that you write for the thread and the restrictions of the language in which you write them.

Using Process is like running two instances of your program; you can pass parameters to the new process, but those are copies that are no longer shared once they are passed. For example, if you modified the data within the main process, the new process wouldn't see the changes, since the two processes have completely separate copies of the data.

If you want to share data, you should really use threads rather than processes. For most multi-processing needs, threads are preferable to processes, except in the few cases where you need the strict separation.

Matt Jordan
  • 2,133
  • 9
  • 10
  • What if I allocated memory and gave the process a coordinate in ram to output the return object to? Is that possible with Python? – Chris Apr 02 '16 at 17:14
  • No, because the coordinates wouldn't be accessible by another process. You could use a named pipe or similar, but that's going to complicate your code by quite a bit. Why do you want to use a Process rather than a Thread? – Matt Jordan Apr 02 '16 at 17:15
  • The process example link you provide is consistent with the information - that example never reads a value back from each spawned process, instead it passes data to each spawned process, the spawned process prints it, and then the main program records that the process completed, without retrieving any additional data (for example, no updated value of i). – Matt Jordan Apr 02 '16 at 17:18
  • I heard that "GIL" makes threading slower in Python. If that isn't true, then I have no reason. – Chris Apr 02 '16 at 18:12
  • 1
    @bordeo: The GIL (global interpreter lock) means two threads can't run Python bytecode at once, which means threading is mostly useful only for I/O related work (and occasionally using third party extension modules that explicitly release the GIL). – ShadowRanger Apr 02 '16 at 18:23
  • 1
    The effect of GIL depends on how much interpreted code you run vs. how much of your code is calling into the underlying library. So, yes, if you are doing a lot of very basic number crunching, you won't see much benefit in multi-processing. Native code (most Python library code) won't be affected by GIL, including most set/list/dictionary operations, since they are implemented in native code. Note that the cost of transferring the results between processes is going to be very high, so if you decide to use processes due to GIL, you need to test performance carefully. – Matt Jordan Apr 02 '16 at 18:26
  • @MattJordan I try to keep everything of any account under the hood. However, I ultimately have to pass a function written in Python to the threads function. The only operations I do in "python" are setting variables to the results of numpy and scipy operations. Finally, I pass the results back to the calling code by setting the value of a list. Am I going to see a lot of loss due to GIL in that case? – Chris Apr 02 '16 at 21:49
  • I believe that NumPy is mostly written in C, so I doubt you will see much of a loss from GIL for that; I don't know about SciPy, but I would guess the same. I suggest trying it with threading first and check CPU utilization, because 1) the multiprocessing part of your code won't change much between using threads and processes, and 2) if you have to change to processes because of GIL, you will be adding inter-process communication, without changing the processing code. – Matt Jordan Apr 02 '16 at 22:26