5

I have an array that determines an ordering of elements:

order = [3, 1, 4, 2]

And then I want to sort another, larger array (containing only those elements):

a = np.array([4, 2, 1, 1, 4, 3, 1, 3])    

such that the element(s) that come first in order come first in the results, etc.
In straight Python, I would do this with a key function:

sorted(a, key=order.index)
[3, 3, 1, 1, 1, 4, 4, 2]

How can I do this (efficiently) with numpy? Is there a similar notion of "key function" for numpy arrays?

Doctor J
  • 5,974
  • 5
  • 44
  • 40

3 Answers3

5

Specific case : Ints

For ints, we could use bincount -

np.repeat(order,np.bincount(a)[order])

Sample run -

In [146]: sorted(a, key=order.index)
Out[146]: [3, 3, 1, 1, 1, 4, 4, 2]

In [147]: np.repeat(order,np.bincount(a)[order])
Out[147]: array([3, 3, 1, 1, 1, 4, 4, 2])

Generic case

Approach #1

Generalizing for all dtypes with bincount -

# https://stackoverflow.com/a/41242285/ @Andras Deak
def argsort_unique(idx):
    n = idx.size
    sidx = np.empty(n,dtype=int)
    sidx[idx] = np.arange(n)
    return sidx

sidx = np.argsort(order)
c = np.bincount(np.searchsorted(order,a,sorter=sidx))
out = np.repeat(order, c[argsort_unique(sidx)])

Approach #2-A

With np.unique and searchsorted for the case when all elements from order are in a -

unq, count = np.unique(a, return_counts=True)
out = np.repeat(order, count[np.searchsorted(unq, order)])

Approach #2-B

To cover for all cases, we need one extra step -

unq, count = np.unique(a, return_counts=1)
sidx = np.searchsorted(unq, order)
out = np.repeat(order, np.where(unq[sidx] == order,count[sidx],0))
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Okay, okay, that's crafty. I'd like to do it for any type of value, but I'll give you partial credit. :) That gives me some ideas. – Doctor J Jun 09 '18 at 00:07
  • @DoctorJ Added few generic solutions. – Divakar Jun 09 '18 at 07:21
  • 1
    @YakymPirozhenko Don't you mean `np.argsort(order).argsort()` being equivalent to `np.searchsorted(unq, order)`? That would be same as `@Doctor J's soln`. – Divakar Jun 09 '18 at 07:43
1

Building on @Divakar's solution, you can count how many times each element occurs and then repeat the ordered elements that many times:

c = Counter(a)
np.repeat(order, [c[v] for v in order])

(You could vectorize the count lookup if you like). I like this because it's linear time, even if it's not pure numpy.

I guess a pure numpy equivalent would look like this:

count = np.unique(a, return_counts=True)[1]
np.repeat(order, count[np.argsort(np.argsort(order))])

But that's less direct, more code, and way too many sorts. :)

Doctor J
  • 5,974
  • 5
  • 44
  • 40
0

This is a fairly direct conversion of your pure-Python approach into numpy. The key idea is replacing the order.index function with a lookup in a sorted vector. Not sure if this is any simpler or faster than the solution you came up with, but it may generalize to some other cases.

import numpy as np
order = np.array([3, 1, 4, 2])
a = np.array([4, 2, 1, 1, 4, 3, 1, 3])  

# create sorted lookup vectors
ord = np.argsort(order)
order_sorted = order[ord]
indices_sorted = np.arange(len(order))[ord]

# lookup the index in `order` for each value in the `a` vector
a_indices = np.interp(a, order_sorted, indices_sorted).astype(int)

# sort `a` using the retrieved index values
a_sorted = a[np.argsort(a_indices)]
a_sorted

# array([3, 3, 1, 1, 1, 4, 4, 2])

This is a more direct way (based on this question), but it seems to be about 4 times slower than the np.interp approach:

lookup_dict = dict(zip(order, range(len(order))))
indices = np.vectorize(lookup_dict.__getitem__)(a)
a_sorted = a[np.argsort(indices)]
Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45