I have an array with several fields, which I want to be sorted with respect to 2 of them. One of these fields is binary, e.g.:
size = 100000
data = np.empty(
shape=2 * size,
dtype=[('class', int),
('value', int),]
)
data['class'][:size] = 0
data['value'][:size] = (np.random.normal(size=size) * 10).astype(int)
data['class'][size:] = 1
data['value'][size:] = (np.random.normal(size=size, loc=0.5) * 10).astype(int)
np.random.shuffle(data)
I need the result to be sorted with respect to value
, and for same values class=0
should go first. Doing it like so (a):
idx = np.argsort(data, order=['value', 'class'])
data_sorted = data[idx]
seems to be an order of magnitude slower compared to sorting just data['value']
. Is there a way to improve the speed, given that there are only two classes?
By experimenting randomly I noticed that an approach like this (b):
idx = np.argsort(data['value'])
data_sorted = data[idx]
idx = np.argsort(data_sorted, order=['value', 'class'], kind='mergesort')
data_sorted = data_sorted[idx]
takes ~20% less time than (a). Changing field datatypes seem to also have some effect - floats instead of ints seem to be slightly faster.