Performance of sorting structured arrays (numpy)

Question

I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:

size = 100000
data = np.empty(
            shape=2 * size,
            dtype=[('class', int),
                   ('value', int),]
)

data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)

np.random.shuffle(data)

I need the result to be sorted with respect to value, and for same values class=0 should go first. Doing it like so (a):

idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]

seems to be an order of magnitude slower compared to sorting just data['value']. Is there a way to improve the speed, given that there are only two classes?

By experimenting randomly I noticed that an approach like this (b):

idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]

takes ~20% less time than (a). Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.

Is there a reason you aren't using `sort(data, order=['value', 'class'])`? It's clearer (and about 10% faster). — user2699, Apr 03 '19 at 13:02

user2699 · Accepted Answer · 2019-04-03T14:09:54.780

The simplest way to do this is using the order parameter of sort

sort(data, order=['value', 'class'])

However, this takes 121 ms to run on my computer, while data['class'] and data['value'] take only 2.44 and 5.06 ms respectively. Interestingly, sort(data, order='class') takes 135 ms again, suggesting the problem is with sorting structured arrays.

So, the approach you've taken of sorting each field using argsort then indexing the final array seems to be on the right track. However, you need to sort each field individually,

idx=argsort(data['class'])
data_sorted = data[idx][argsort(data['value'][idx], kind='stable')]

This runs in 43.9 ms. You can get a very slight speedup by removing one temporary array from indexing

idx = argsort(data['class'])
tmp = data[idx]
data_sorted = tmp[argsort(tmp['value'], kind='stable')]

Which runs in 40.8 ms. Not great, but it is a workaround if performance is critical.

This seems to be a known problem: sorting numpy structured and record arrays is very slow

Edit The sourcecode for the comparisons used in sort can be seen at https://github.com/numpy/numpy/blob/dea85807c258ded3f75528cce2a444468de93bc1/numpy/core/src/multiarray/arraytypes.c.src . The numeric types are much, much simpler. Still, that large of a difference in performance is surprising.

`np.lexsort` is also quite a bit slower than `np.sort` (on a unstructured array). — hpaulj, Apr 03 '19 at 16:18
Didn't realize there's this concept of stability, thanks! Indeed this approach boosts sorting of the structured arrays significantly. — SiLiKhon, Apr 04 '19 at 07:56
From my understanding of the Numpy code, it seems `sort(data, order=['value', 'class'])` uses a general function comparator that is a CPython function and not a native function like with basic integers. This is certainly the main reason why the code is much slower. It turns out that the code also indirectly performs dict access in this case. This approach of Numpy is clearly inefficient but it is not easy to write a fast code since the several sorted fields can be of any type... Thank you for pointing this out. I will try to optimize the Numpy code since this use-case is quite frequent. — Jérôme Richard, Mar 16 '22 at 01:57

score 1 · Answer 2 · answered Mar 16 '22 at 01:48

In addition to the good (general-purpose) answer of @user2699, in your specific case, you can cheat because the two fields of the structured array is of the same integer type and values are relatively small (they fit in 32-bits). The cheat consists in the following steps:

subtract the minimum values of each fields to all items the field (to make them positive) using arr - np.min(arr)
transform each field to a np.uint64 with np.astype
pack bits the two fields in one binary array using: (class_arr << 32) | value_arr
sort the resulting array using np.sort
unpack the array using: class_arr = sorted_arr >> 32 and value_arr = sorted_arr & ((1<<32)-1)

This strategy is significantly faster than using two np.argsort that are pretty expensive. This is especially true for bigger array since sorting big array is even more expensive and np.sort is cheaper than np.argsort. Not to mention indirect indexing is relatively slow on big array because of the unpredictable pseudo-random memory access pattern and the high latency of the RAM. The downside of this approach is that it is a bit more tricky to implement and it does not apply in all cases.

Performance of sorting structured arrays (numpy)

2 Answers2