"bucketsort" with pythons multiprocessing

Question

I have a data series with a uniform distribution. I wish to exploit the distribution to sort the data in parallel. For N CPUs, I essentially define N buckets and sort the buckets in parallel. My problem is that, I do not get a speed up.

What is wrong?

from multiprocessing import Process, Queue
from numpy import array, linspace, arange, where, cumsum, zeros
from numpy.random import rand
from time import time


def my_sort(x,y):
    y.put(x.get().argsort())

def my_par_sort(X,np):
 p_list=[]
 Xq = Queue()
 Yq = Queue()
 bmin = linspace(X.min(),X.max(),np+1) #bucket lower bounds
 bmax = array(bmin); bmax[-1] = X.max()+1 #bucket upper bounds
 B = []
 Bsz = [0]
 for i in range(np):
  b = array([bmin[i] <= X, X < bmax[i+1]]).all(0)
  B.append(where(b)[0])
  Bsz.append(len(B[-1]))
  Xq.put(X[b])
  p = Process(target=my_sort, args=(Xq,Yq))
  p.start()
  p_list.append(p)

 Bsz = cumsum(Bsz).tolist()
 Y = zeros(len(X)) 
 for i in range(np):
   Y[arange(Bsz[i],Bsz[i+1])] = B[i][Yq.get()]
   p_list[i].join()

 return Y


if __name__ == '__main__':
 num_el = 1e7
 mydata = rand(num_el)
 np = 4 #multiprocessing.cpu_count()
 starttime = time()
 I = my_par_sort(mydata,np)
 print "Sorting %0.0e keys took %0.1fs using %0.0f processes" % (len(mydata),time()-starttime,np)
 starttime = time()
 I2 = mydata.argsort()
 print "in serial it takes %0.1fs" % (time()-starttime)
 print (I==I2).all()

score 2 · Answer 1 · answered Jul 04 '14 at 16:03

It looks like your problem is the amount of overhead you're adding when you break the original array into pieces. I took your code, and just removed all usage of multiprocessing:

def my_sort(x,y): 
    pass
    #y.put(x.get().argsort())

def my_par_sort(X,np, starttime):
    p_list=[]
    Xq = Queue()
    Yq = Queue()
    bmin = linspace(X.min(),X.max(),np+1) #bucket lower bounds
    bmax = array(bmin); bmax[-1] = X.max()+1 #bucket upper bounds
    B = []
    Bsz = [0] 
    for i in range(np):
        b = array([bmin[i] <= X, X < bmax[i+1]]).all(0)
        B.append(where(b)[0])
        Bsz.append(len(B[-1]))
        Xq.put(X[b])
        p = Process(target=my_sort, args=(Xq,Yq, i)) 
        p.start()
        p_list.append(p)
    return

if __name__ == '__main__':
    num_el = 1e7 
    mydata = rand(num_el)
    np = 4 #multiprocessing.cpu_count()
    starttime = time()
    I = my_par_sort(mydata,np, starttime)
    print "Sorting %0.0e keys took %0.1fs using %0.0f processes" % (len(mydata),time()-starttime,np)
    starttime = time()
    I2 = mydata.argsort()
    print "in serial it takes %0.1fs" % (time()-starttime)
    #print (I==I2).all()

With absolutely no sorting happening, the multiprocessing code takes just as long as the serial code:

Sorting 1e+07 keys took 2.2s using 4 processes
in serial it takes 2.2s

You may be thinking that the overhead of starting processes and passing values between them is the cause of the overhead, but if I remove all usage of multiprocessing, including the Xq.put(X[b]) call, it ends up being just slightly faster:

Sorting 1e+07 keys took 1.9s using 4 processes
in serial it takes 2.2s

So it seems you need to investigate a more efficient way of breaking your array into pieces.

score 0 · Accepted Answer · edited May 23 '17 at 10:30

In my view there are two main problems.

The overhead of multiple processes and communicating between them

Spawning a couple of Python interpreters causes some overhead, but mainly passing data to and from the "worker" processes is killing performance. Data that you pass through the Queue needs to be "pickled" and "unpickled", which is somewhat slow for larger data (and you need to do this two times).

You don't need to use Queues if you'd use threads instead of processes. Using threads in CPython for CPU heavy tasks is often regarded as inefficient, because generally you will run into the Global Interpreter Lock, but not always! Luckily Numpy's sorting functions seem to be releasing the GIL, so using threads is a viable option!
The partitioning and joining of the dataset

Partitioning and joining the data is an inevitable cost of this "bucketsort approach", but can be relieved somewhat by doing it more efficiently. In particar these two lines of code
```
b = array([bmin[i] <= X, X < bmax[i+1]]).all(0)

Y[arange(Bsz[i],Bsz[i+1])] = ...
```
Can be rewriten to
```
b = (bmin[i] <= X) & (X < bmax[i+1])

Y[Bsz[i] : Bsz[i+1]] = ...
```
Improving some more I also found np.take to be faster than "fancy indexing" and np.partition also useful.

Summarizing, the fastest that I could make it is the following (but it still doesn't scale linearly with the number of cores like you would want..):

from threading import Thread

def par_argsort(X, nproc):
    N = len(X)
    k = range(0, N+1, N//nproc)
    I = X.argpartition(k[1:-1])
    P = X.take(I)

    def worker(i):
        s = slice(k[i], k[i+1])
        I[s].take(P[s].argsort(), out=I[s])

    t_list = []
    for i in range(nproc):
        t = Thread(target=worker, args=(i,))
        t.start()
        t_list.append(t)

    for t in t_list:
        t.join()

    return I

"bucketsort" with pythons multiprocessing

2 Answers2