I came across the same question and wondered about the performance of the different ways of sorting one array and reordering another accordingly.
Performance comparison two array case
I think the list of solutions mentioned here is comprehensive but I was wondering also about the performance. Thus I implemented all algorithms and did a performance comparison.
Sorting using zip twice
def zip_sort(s, p):
ordered_s, ordered_p = zip(*sorted(list(zip(s, p))))
return np.array(ordered_s, dtype=s.dtype), np.array(ordered_p, dtype=p.dtype)
Sorting using argsort. This will not consider the other array for auxiliary sorting
def argsort(s, p):
indexes = s.argsort()
return s[indexes], p[indexes]
Sorting using numpy recarrays
def recarray_sort(s, p):
rec = np.rec.fromarrays([s, p])
rec.sort()
return rec.f0, rec.f1
Sorting using numpy lexsort
def lexsort(s, p):
indexes = np.lexsort([p, s])
return s[indexes], p[indexes]
Sorting two lists p and q of 100000 random integers will yield the following performance
zip_sort
258 ms ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
argsort
9.67 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
recarray_sort
86.4 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
lexsort
12.4 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hence argsort is fastest but will also yield slightly different results than the other algorithms. In case auxiliary sorting is not needed argsort should be used.
Performance comparison multi array case
Next, one might need to do such sorting for multiple arrays. Modifying the algorithms to handle multiple arrays looks like
Sorting using zip twice
def zip_sort(*arrays):
ordered_lists = zip(*sorted(list(zip(*arrays))))
return tuple(
(np.array(l, dtype=arrays[i].dtype) for i, l in enumerate(ordered_lists))
)
Sorting using argsort. This will not consider the other arrays for auxiliary sorting
def argsort(*arrays):
indexes = arrays[0].argsort()
return tuple((a[indexes] for a in arrays))
Sorting using numpy recarrays
def recarray_sort(*arrays):
rec = np.rec.fromarrays(arrays)
rec.sort()
return tuple((getattr(rec, field) for field in rec.dtype.names))
Sorting using numpy lexsort
def lexsort(*arrays):
indexes = np.lexsort(arrays[::-1])
return tuple((a[indexes] for a in arrays))
Sorting a list of 100 arrays with each 100000 random integers (arrays = [np.random.randint(10, size=100000) for _ in range (100)]
) yields now the following performance
zip_sort
13.9 s ± 570 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
argsort
49.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
recarray_sort
491 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
lexsort
881 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
argsort remains fastest which seems logical due to ignoring auxiliary sorting. For the other algorithms, those with auxiliary column sorting, the recarray based solution now out beats the lexsort variant.
Disclaimer: Results may vary for other dtypes and also depend on the randomness of the array data. I used 42 as seed.