0

I'm trying to implement a function that uses python multiprocessing in order to speed-up a calculation. I'm trying to create a pairwise distance matrix but the implementation with for loops takes more than 8 hours.

This code seems to work faster but when I print the matrix is full of zeros. When I print the rows in the function it seems to work. I think is a scope problem but I cannot understand how to deal with it.

import multiprocessing
import time
import numpy as np

def MultiProcessedFunc(i,x):
    for j in range(i,len(x)):
        time.sleep(0.08)
        M[i,j] = (x[i]+x[j])/2
    print(M[i,:]) # Check if the operation works
    print('')

processes = []

v = [x+1 for x in range(8000)]
M = np.zeros((len(v),len(v)))

for i in range(len(v)):
    p = multiprocessing.Process(target = MultiProcessedFunc, args =(i,v))
    processes.append(p)
    p.start()

for process in processes:
    process.join()
end = time.time()

print('Multiprocessing: {}'.format(end-start))
print(M)

RonCajan
  • 139
  • 17
Giuseppe Minardi
  • 411
  • 4
  • 16
  • Note that if what you are after is just a pair-wise euclidean distance, https://stackoverflow.com/a/47775357/42610 will give you much faster code. – liori Feb 14 '19 at 20:46

1 Answers1

2

Unfortunately your code wont work written in that way. Multiprocessing spawn separate processes, which means that the memory space are separate! Changes made by one subprocess will not be reflected in the other processes or your parent processes.

Strictly speaking this is not a scoping issue. Scope is something defined inside a single interpreter process.

The module does provide means of sharing memory between processes but this comes at a cost (shared memory is way slower due to locking issues and such.

Now, numpy has a nice feature: it releases the GIL during computation. This means that using multi threading instead of multiprocessing should give you some benefit with little other changes to your code, simply replace import multiprocessing with import threading and multiprocessing.Process into threading.Thread. The code should produce the correct result. On my machine, removing the print statements and the sleep code it runs in under 8 seconds:

Multiprocessing: 7.48570203781
[[1.000e+00 1.000e+00 2.000e+00 ... 3.999e+03 4.000e+03 4.000e+03]
 [0.000e+00 2.000e+00 2.000e+00 ... 4.000e+03 4.000e+03 4.001e+03]
 [0.000e+00 0.000e+00 3.000e+00 ... 4.000e+03 4.001e+03 4.001e+03]
 ...
 [0.000e+00 0.000e+00 0.000e+00 ... 7.998e+03 7.998e+03 7.999e+03]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 7.999e+03 7.999e+03]
 [0.000e+00 0.000e+00 0.000e+00 ... 0.000e+00 0.000e+00 8.000e+03]]

An alternative is to have your subprocesses return the result and then combine the results in your main process.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • Definitely could use a shared memory array from `multiprocessing`, but this is a classic case for rewrite as a `numpy` array (as you indicate). – Mike McKerns Feb 14 '19 at 20:16
  • @MikeMcKerns Yes, the multiprocessing array provides no vector operations, which means it would be *way* slower. – Bakuriu Feb 14 '19 at 20:20
  • The problem is that in the original program I do not have a list with numbers but a list with strings and I cannot use numpy for it – Giuseppe Minardi Feb 14 '19 at 20:22
  • Also the sleep part is ment to emulate what happen in the real code – Giuseppe Minardi Feb 14 '19 at 21:02
  • 1
    @GiuseppeMinardi Using numbers vs strings almost always require different approaches. Without the details we cannot give you better advice than what I wrote: if you need shared memory you have to use the `multiprocessing` api, the alternative is to use `multiprocessing.Pool` methods to return the results and combine them afterwards. Which one works better depends on what you are going to do and you will probably have to profile the two solutions to see which one is better in your case. – Bakuriu Feb 14 '19 at 21:32
  • Yeah sorry, I'm not a computer scientist and I'm relatively new to stack overflow, I will keep it in mind – Giuseppe Minardi Feb 14 '19 at 21:39