1

I have two 1D-arrays containing the same set of values, but in a different (random) order. I want to find the list of indices, which reorders one array according to the other one. For example, my 2 arrays are:

ref = numpy.array([5,3,1,2,3,4])
new = numpy.array([3,2,4,5,3,1])

and I want the list order for which new[order] == ref.

My current idea is:

def find(val):
    return numpy.argmin(numpy.absolute(ref-val))

order = sorted(range(new.size), key=lambda x:find(new[x]))

However, this only works as long as no values are repeated. In my example 3 appears twice, and I get new[order] = [5 3 3 1 2 4]. The second 3 is placed directly after the first one, because my function val() does not track which 3 I am currently looking for.

So I could add something to deal with this, but I have a feeling there might be a better solution out there. Maybe in some library (NumPy or SciPy)?

Edit about the duplicate: This linked solution assumes that the arrays are ordered, or for the "unordered" solution, returns duplicate indices. I need each index to appear only once in order. Which one comes first however, is not important (neither possible based on the data provided).

What I get with sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)] is: [3, 0, 5, 1, 0, 2]. But what I am looking for is [3, 0, 5, 1, 4, 2].

Feodoran
  • 1,752
  • 1
  • 14
  • 31
  • Does it really matter which index you are getting if the element is repeated? Are you trying to do something more than `a[ind]` to get `b`? – Mad Physicist Oct 16 '17 at 14:13
  • That being said , yes there is a way using multiple argsorts. I'll write it up as soon as I get to an actual computer. – Mad Physicist Oct 16 '17 at 14:14
  • @Divakar. This is not the same question. It's asking to find the indices of a shuffle, not a subset. As such, there is a nice optimization possible using argsort which does not apply to the other question. I hope you support my bid to reopen. – Mad Physicist Oct 16 '17 at 14:27
  • @MadPhysicist Not sure what subset you are referring to. The searchsorted solution there gives the indices, which is the expected `order` in this question. Did you try that solution? – Divakar Oct 16 '17 at 14:30
  • @Divakar I tried, and I got some multiple indices in `order` (see edit). – Feodoran Oct 16 '17 at 14:31
  • @Feodoran Your `new` would be A. Your `ref` would be B. Now, using the linked solution : `sort_idx = A.argsort(); order = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]`. Does that answer your question? If it doesn't, please do explain how/why it doesn't. Would be happy to re-open otherwise. – Divakar Oct 16 '17 at 14:33
  • @Divakar. The solutions to the other question work here just fine, no argument there. This question has an additional rule though, that the arrays are guaranteed to be the same size. This leads to a solution that does not apply to the other question. I would like to record this solution and I think that it's existence indicates that this is not exactly the same question. – Mad Physicist Oct 16 '17 at 14:33
  • @MadPhysicist Hmm good point. Would love to see an optimized version for such a case (if there's any). Re-opened. – Divakar Oct 16 '17 at 14:39
  • @Divakar as already mentioned, yes I tried. I added what I actually get and I need to the question. – Feodoran Oct 16 '17 at 14:43
  • @Divakar. Not sure about optimized , but the indices will be unique, which seems to be a concern. – Mad Physicist Oct 16 '17 at 14:43

1 Answers1

2

Given ref, new which are shuffled versions of each other, we can get the unique indices that map ref to new using the sorted version of both arrays and the invertibility of np.argsort.

Start with:

i = np.argsort(ref)
j = np.argsort(new)

Now ref[i] and new[j] both give the sorted version of the arrays, which is the same for both. You can invert the first sort by doing:

k = np.argsort(i)

Now ref is just new[j][k], or new[j[k]]. Since all the operations are shuffles using unique indices, the final index j[k] is unique as well. j[k] can be computed in one step with

order = np.argsort(new)[np.argsort(np.argsort(ref))]

From your original example:

>>> ref = np.array([5, 3, 1, 2, 3, 4])
>>> new = np.array([3, 2, 4, 5, 3, 1])
>>> np.argsort(new)[np.argsort(np.argsort(ref))]
>>> order
array([3, 0, 5, 1, 4, 2])
>>> new[order]  # Should give ref
array([5, 3, 1, 2, 3, 4])

This is probably not any faster than the more general solutions to the similar question on SO, but it does guarantee unique indices as you requested. A further optimization would be to to replace np.argsort(i) with something like the argsort_unique function in this answer. I would go one step further and just compute the inverse of the sort:

def inverse_argsort(a):
    fwd = np.argsort(a)
    inv = np.empty_like(fwd)
    inv[fwd] = np.arange(fwd.size)
    return inv

order = np.argsort(new)[inverse_argsort(ref)]
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Idea seems correct. To optimize it further, we could use [`argsort_unique`](https://stackoverflow.com/a/43411559/) to get `k`. – Divakar Oct 16 '17 at 15:13
  • @Divakar. I added it to the answer. – Mad Physicist Oct 16 '17 at 15:20
  • @Feodoran. I see the purpose of your edit. I clobbered it, but will update myself. – Mad Physicist Oct 16 '17 at 15:21
  • Nice, thank you. I was a litte worried at first about calling three times `argsort`. But the `inverse_argsort` optimizes this quite well, more than twice as fast. – Feodoran Oct 16 '17 at 15:42
  • Can this be extended to a multi-dimensional version? In the sense that the elements of `ref` and `new` are not scalars but arrays as well. So 1 axis needs to be reordered, the others are kept fixed. For example in 2D, such that: `new[order,:] == ref`. – Feodoran Oct 19 '17 at 15:45
  • @Feodoran. I would imagine so. `np.argsort` has an `axis` parameter to tell you which dimension to sort on. The default is the last one (`-1`). – Mad Physicist Oct 19 '17 at 16:05
  • The problem is: `numpy.sort(a, axis=0)` sorts each column independently, thus mixing the rows. Accordingly `argsort(a, axis=0)` returns a 2D array, but I need a 1D here. (One could ask now how exactly the different columns should be compared and ordered. But the how does not really matter here, as long as it is consistent.) – Feodoran Oct 19 '17 at 16:28
  • @Feodoran. I see the problem. According to [this Q/A](https://stackoverflow.com/q/18920010/2988730), you can use [`np.lexsort`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.lexsort.html). – Mad Physicist Oct 19 '17 at 16:34
  • 1
    @Feodoran. The inversion function would work just fine if you replaced `argsort` with the appropriate call to `lexsort`. – Mad Physicist Oct 19 '17 at 16:35
  • Yes, thank you! Replacing `argsort(a)` with `lexsort(a.T, axis=0)` (both calls) is what I needed. – Feodoran Oct 19 '17 at 17:16
  • @Feodoran. `lexsort(a.T, axis=0)` should be the same thing as `lexsort(a, axis=1).T`. Not sure if that helps you any, but it might. – Mad Physicist Oct 19 '17 at 23:13
  • No, `lexsort(a, axis=1)` throws some error. And the final `.T` does not really matter since the output is 1D. – Feodoran Oct 20 '17 at 07:28
  • @Feodoran. I just realized what `np.lexsort(a.T)` does. You are absolutely correct. – Mad Physicist Oct 20 '17 at 14:11