9

it looks like sorting numpy structured and record arrays by a single column is much slower than doing a sort on a similar standalone array:

In [111]: a = np.random.rand(1e4)

In [112]: b = np.random.rand(1e4)

In [113]: rec = np.rec.fromarrays([a,b])

In [114]: timeit rec.argsort(order='f0')
100 loops, best of 3: 18.8 ms per loop

In [115]: timeit a.argsort()
1000 loops, best of 3: 891 µs per loop

There is a marginal improvement using the structured array, but it's not dramatic:

In [120]: struct = np.empty(len(a),dtype=[('a','f8'),('b','f8')])

In [121]: struct['a'] = a

In [122]: struct['b'] = b

In [124]: timeit struct.argsort(order='a')
100 loops, best of 3: 15.8 ms per loop

This indicates that it's potentially faster to create an index array from argsort and then use that to reorder the individual arrays. This is OK except that I expect to be dealing with very large arrays and would like to avoid copying data as much as possible. Is there a more efficient way of doing this that I'm missing?

Rok
  • 613
  • 4
  • 17

2 Answers2

4

What´s slowing you is the use of order, not the fact that you have a record array. If you want to sort by a single field, do it like this:

In [12]: %timeit np.argsort(rec['f0'])
1000 loops, best of 3: 829 us per loop

Once order is used, performance goes south no matter how many fields you want to sort by:

In [16]: %timeit np.argsort(rec, order=['f0'])
10 loops, best of 3: 27.9 ms per loop

In [17]: %timeit np.argsort(rec, order=['f0', 'f1'])
10 loops, best of 3: 28.4 ms per loop
Jaime
  • 65,696
  • 17
  • 124
  • 159
  • Aha! I figured order did the np.argsort() under the hood, but I guess not? – Rok Oct 31 '13 at 08:44
  • 1
    but actually, this doesn't solve the issue of copying the data -- it will require me to pass the indices returned by argsort which will result in a copy. – Rok Oct 31 '13 at 10:57
3

As Jaime have said, you can use argsort to sort the record array.

inds = np.argsort(rec['f0'])

And use take to avoid making a copy

np.take(rec, inds, out=rec)
imsc
  • 7,492
  • 7
  • 47
  • 69
  • 2
    The only reason that works is because `np.take` makes a copy when you specify an `out` parameter and leave the `mode` one in its default `'raise'` state, you can look [at the source](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/item_selection.c#L99). If you use another `mode` there will be no copy, but the output will be rubbish, with some values repeated several times and others missing altogether. – Jaime Jul 16 '14 at 19:52