How can I "zip sort" parallel numpy arrays?

Question

If I have two parallel lists and want to sort them by the order of the elements in the first, it's very easy:

>>> a = [2, 3, 1]
>>> b = [4, 6, 7]
>>> a, b = zip(*sorted(zip(a,b)))
>>> print a
(1, 2, 3)
>>> print b
(7, 4, 6)

How can I do the same using numpy arrays without unpacking them into conventional Python lists?

@YGA, will your input array "a" ever have non-unique values? If so, how would you like the sort to behave in that case? Arbitrary order? Stable sort? Secondary sort using corresponding values in array "b"? — Peter Hansen, Dec 15 '09 at 11:36
This isn't the greatest example since the solution (sorting `a` and `b` together) is the same as if you sort `a` and `b` independently. — Steve, Apr 03 '19 at 00:27
Steve - good point. I've now updated the question and all the answers below so that the two arrays wouldn't have the same independent sort order. — YGA, Apr 04 '19 at 15:49

score 110 · Accepted Answer · edited Apr 04 '19 at 15:47

110

b[a.argsort()] should do the trick.

Here's how it works. First you need to find a permutation that sorts a. argsort is a method that computes this:

>>> a = numpy.array([2, 3, 1])
>>> p = a.argsort()
>>> p
[2, 0, 1]

You can easily check that this is right:

>>> a[p]
array([1, 2, 3])

Now apply the same permutation to b.

>>> b = numpy.array([4, 6, 7])
>>> b[p]
array([7, 4, 6])

edited Apr 04 '19 at 15:47

YGA

9,546
15
47
50

answered Dec 14 '09 at 21:19

Jason Orendorff

42,793
6
62
96

2

This doesn't use `b` for "auxiliary sorting", for example when `a` has elements that repeat. Please see my answer for details. – Alok Singhal Dec 15 '09 at 03:35
1

otoh, auxiliary sorting is not always desired. – tacaswell Oct 14 '13 at 23:03

score 25 · Answer 2 · edited Apr 04 '19 at 15:48

25

Here's an approach that creates no intermediate Python lists, though it does require a NumPy "record array" to use for the sorting. If your two input arrays are actually related (like columns in a spreadsheet) then this might open up an advantageous way of dealing with your data in general, rather than keeping two distinct arrays around all the time, in which case you'd already have a record array and your original problem would be answered merely by calling sort() on your array.

This does an in-place sort after packing both arrays into a record array:

>>> from numpy import array, rec
>>> a = array([2, 3, 1])
>>> b = array([4, 6, 7])
>>> c = rec.fromarrays([a, b])
>>> c.sort()
>>> c.f1   # fromarrays adds field names beginning with f0 automatically
array([7, 4, 6])

Edited to use rec.fromarrays() for simplicity, skip redundant dtype, use default sort key, use default field names instead of specifying (based on this example).

edited Apr 04 '19 at 15:48

YGA

9,546
15
47
50

answered Dec 14 '09 at 22:35

Peter Hansen

21,046
5
50
72

Thanks! I really wish I could accept two answers. This one is less simple but more general. I've upvoted it though, as the least I could do :-) – YGA Dec 16 '09 at 17:18
@YGA, was your edit just to avoid possible confusion from the fact a "2" was in both lists and/or to show that f0 is the sort key, so f1 won't necessarily end up sorted? If not, I can't see the reason for it. If so, thanks: nice touch. :-) – Peter Hansen Apr 08 '19 at 15:23
1

It was because there was a comment above to note that there was a potential source of confusion in that both arrays had the same sort order. – YGA Apr 09 '19 at 22:02

Matthias Fripp · Answer 3 · 2019-04-04T19:58:48.087

4

Like @Peter Hansen's answer, this makes a copy of the arrays before it sorts them. But it is simple, does the main sort in-place, uses the second array for auxiliary sorting, and should be very fast:

a = np.array([2, 3, 1])
b = np.array([4, 6, 2])
# combine, sort and break apart
a, b = np.sort(np.array([a, b]))

Update: The code above doesn't actually work, as pointed out in a comment. Below is some better code. This should be fairly efficient—e.g., it avoids explicitly making extra copies of the arrays. It's hard to say how efficient it will be, because the documentation doesn't give any details on the numpy.lexsort algorithm. But it should work pretty well, since this is exactly the job lexsort was written for.

a = np.array([5, 3, 1])
b = np.array([4, 6, 7])
new_order = np.lexsort([b, a])
a = a[new_order]
b = b[new_order]
print(a, b)
# (array([1, 3, 5]), array([7, 6, 4]))

edited Apr 04 '19 at 19:58

answered May 11 '18 at 07:52

Matthias Fripp

17,670
5
28
45

This doesn't maintain the same order between arrays; it sorts the two arrays independently. – Steve Apr 03 '19 at 00:25
1

Thanks, looks like I originally assumed `np.sort` worked like `list.sort`, and didn't catch it in testing because the example arrays should sort the same way either separately or lexicographically. I've given a better answer now (which turns out to be a simpler version of [@doug's answer](https://stackoverflow.com/a/1904880/3830997)). – Matthias Fripp Apr 03 '19 at 18:58

score 3 · Answer 4 · answered Apr 09 '21 at 12:05

I came across the same question and wondered about the performance of the different ways of sorting one array and reordering another accordingly.

Performance comparison two array case

I think the list of solutions mentioned here is comprehensive but I was wondering also about the performance. Thus I implemented all algorithms and did a performance comparison.

Sorting using zip twice

def zip_sort(s, p):
    ordered_s, ordered_p = zip(*sorted(list(zip(s, p))))
    return np.array(ordered_s, dtype=s.dtype), np.array(ordered_p, dtype=p.dtype)

Sorting using argsort. This will not consider the other array for auxiliary sorting

def argsort(s, p):
    indexes = s.argsort()
    return s[indexes], p[indexes]

Sorting using numpy recarrays

def recarray_sort(s, p):
    rec = np.rec.fromarrays([s, p])
    rec.sort()
    return rec.f0, rec.f1

Sorting using numpy lexsort

def lexsort(s, p):
    indexes = np.lexsort([p, s])
    return s[indexes], p[indexes]

Sorting two lists p and q of 100000 random integers will yield the following performance

zip_sort
258 ms ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort
9.67 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

recarray_sort
86.4 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

lexsort
12.4 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Hence argsort is fastest but will also yield slightly different results than the other algorithms. In case auxiliary sorting is not needed argsort should be used.

Performance comparison multi array case

Next, one might need to do such sorting for multiple arrays. Modifying the algorithms to handle multiple arrays looks like

Sorting using zip twice

def zip_sort(*arrays):
    ordered_lists = zip(*sorted(list(zip(*arrays))))
    return tuple(
        (np.array(l, dtype=arrays[i].dtype) for i, l in enumerate(ordered_lists))
    )

Sorting using argsort. This will not consider the other arrays for auxiliary sorting

def argsort(*arrays):
    indexes = arrays[0].argsort()
    return tuple((a[indexes] for a in arrays))

Sorting using numpy recarrays

def recarray_sort(*arrays):
    rec = np.rec.fromarrays(arrays)
    rec.sort()
    return tuple((getattr(rec, field) for field in rec.dtype.names))

Sorting using numpy lexsort

def lexsort(*arrays):
    indexes = np.lexsort(arrays[::-1])
    return tuple((a[indexes] for a in arrays))

Sorting a list of 100 arrays with each 100000 random integers (arrays = [np.random.randint(10, size=100000) for _ in range (100)]) yields now the following performance

zip_sort
13.9 s ± 570 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort
49.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

recarray_sort
491 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lexsort
881 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort remains fastest which seems logical due to ignoring auxiliary sorting. For the other algorithms, those with auxiliary column sorting, the recarray based solution now out beats the lexsort variant.

Disclaimer: Results may vary for other dtypes and also depend on the randomness of the array data. I used 42 as seed.

score 2 · Answer 5 · edited Jan 02 '10 at 21:23

This might the simplest and most general way to do what you want. (I used three arrays here, but this will work on arrays of any shape, whether two columns or two hundred).

import numpy as NP
fnx = lambda : NP.random.randint(0, 10, 6)
a, b, c = fnx(), fnx(), fnx()
abc = NP.column_stack((a, b, c))
keys = (abc[:,0], abc[:,1])          # sort on 2nd column, resolve ties using 1st col
indices = NP.lexsort(keys)        # create index array
ab_sorted = NP.take(abc, indices, axis=0)

One quirk w/ lexsort is that you have to specify the keys in reverse order, i.e., put your primary key second and your secondary key first. In my example, i want to sort using the 2nd column as the primary key so i list it second; the 1st column resolves ties only, but it is listed first).

How can I "zip sort" parallel numpy arrays?

5 Answers5

Performance comparison two array case

Performance comparison multi array case

Linked

Related