11

I often need to sort large numpy arrays (few billion elements), which became a bottleneck of my code. I am looking for a way to parallelize it.

Are there any parallel implementations for the ndarray.sort() function? Numexpr module provides parallel implementation for most math operations on numpy arrays, but lacks sorting capabilities.

Maybe, it is possible to make a simple wrapper around a C++ implementation of parallel sorting, and use it through Cython?

Maxim Imakaev
  • 1,435
  • 2
  • 13
  • 26
  • You could take a look at Theano (http://deeplearning.net/software/theano/index.html). I'm not sure if its sort() function is parallel - but it is compiled and runs on a GPU. – Dietrich Dec 20 '14 at 14:59
  • You may find https://github.com/Quansight/pnumpy useful. – Mark Horvath Oct 04 '22 at 22:55

2 Answers2

10

I ended up wrapping GCC parallel sort. Here is the code:

parallelSort.pyx

# cython: wraparound = False
# cython: boundscheck = False
import numpy as np
cimport numpy as np
import cython
cimport cython 

ctypedef fused real:
    cython.char
    cython.uchar
    cython.short
    cython.ushort
    cython.int
    cython.uint
    cython.long
    cython.ulong
    cython.longlong
    cython.ulonglong
    cython.float
    cython.double

cdef extern from "<parallel/algorithm>" namespace "__gnu_parallel":
    cdef void sort[T](T first, T last) nogil 

def numpyParallelSort(real[:] a):
    "In-place parallel sort for numpy types"
    sort(&a[0], &a[a.shape[0]])

Extra compiler args: -fopenmp (compile) and -lgomp (linking)

This makefile will do it:

all:
    cython --cplus parallelSort.pyx  
    g++  -g -march=native -Ofast -fpic -c    parallelSort.cpp -o parallelSort.o -fopenmp `python-config --includes`
    g++  -g -march=native -Ofast -shared  -o parallelSort.so parallelSort.o `python-config --libs` -lgomp 

clean:
    rm -f parallelSort.cpp *.o *.so

And this shows that it works:

from parallelSort import numpyParallelSort
import numpy as np 
a = np.random.random(100000000)

numpyParallelSort(a) 
print a[:10]

edit: fixed bug noticed in the comment below

Maxim Imakaev
  • 1,435
  • 2
  • 13
  • 26
  • 2
    Great answer, and works well (sorts 3.2 billion floats in under two minutes!!!) However, there is an interesting error. If you look at the end of the list `a[-10:0]` you will see that the last element of the original is not sorted. I had to change `&a[a.shape[0]-1]` to `a[a.shape[0]]` to get correct sorting. – Nelz11 Feb 09 '16 at 19:27
  • To spare someone a bunch of time, it is actually important to turn off bounds checking in this example, since otherwise, it will be unhappy that you are accessing one beyond the end of the array in the call to sort. – Zach Boyd Jan 22 '20 at 14:29
  • I don't think it matters here, since the sort is done in C++ STL, not in cython. There is even no copying data in Cython, just passing pointers. However, In general, that would be important in Cython. – Maxim Imakaev Jan 23 '20 at 15:10
  • I have a ImportError while trying to compile, an idea? thanks ```ImportError: dynamic module does not define module export function (PyInit_parallelSort)``` – antho Aug 29 '20 at 01:43
2

Mergesort parallelizes quite naturally. Just have each worker pre-sort an arbitrary chunk, and then run a single merge pass on it. The final merging should require only O(N) operations, and its trivial to write a function for doing so in numba or somesuch.

Wikipedia agrees

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • Actually, there is a parallel implementation of stl sort, as described here: https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html – Maxim Imakaev Jan 11 '15 at 05:33