Sorting a numpy array based on data from another array

Question

I have two sets of array data and result. result contains the same elements in data but with an extra column and in unsorted order. I want to rearrange the result array so that it is in the same order as the rows in data , while bringing the associated value into the last column with the rest of the row when doing the sorting.

data = np.array([[0,1,0,0],[1,0,0,0],[0,1,1,0],[0,1,0,1]])
result = np.array([[0,1,1,0,1],[1,0,0,0,0],[0,1,0,0,1],[0,1,0,1,0]])

# this is what the final sorted array should look like:
'''
array([[0, 1, 0, 0, 1],
       [1, 0, 0, 0, 0],
       [0, 1, 1, 0, 1],
       [0, 1, 0, 1, 0]])
 '''

I've tried doing argsort in order to reverse data into the sorted order then applying that to result but argsort seems to sort the order of the array based on each element, whereas I want the sort to treat each row of the data[:,4] as a whole.

ind = np.argsort(data)
indind =np.argsort(ind)
ind
array([[0, 2, 3, 1],
   [1, 2, 3, 0],
   [0, 3, 1, 2],
   [0, 2, 1, 3]])

What is a good way to do this kind of sorting by rows?

is the extra column always placed last in the sequence? – Deusdeorum Apr 10 '16 at 21:33 — Deusdeorum, Apr 10 '16 at 21:33

Eelco Hoogendoorn · Answer 1 · 2016-04-11T06:43:13.237

The numpy_indexed package (disclaimer: I am its author) can be used to efficiently and elegantly solve these kind of problems:

import numpy_indexed as npi
result[npi.indices(result[:, :-1], data)]

npi.indices is essentially a vectorized equivalent of list.index; so for each element (row) in data, we get where that same row is located in result, minus the last column.

Note that this solution works for any number of columns, and is fully vectorized (ie, no python loops anywhere).

hpaulj · Answer 2 · 2016-04-10T23:09:52.843

Just to try to clarify what you are doing. With an index list [2,1,0,3] I can reorder the rows of result thus:

In [37]: result[[2,1,0,3],:]
Out[37]: 
array([[0, 1, 0, 0, 1],
       [1, 0, 0, 0, 0],
       [0, 1, 1, 0, 1],
       [0, 1, 0, 1, 0]])

In [38]: result[[2,1,0,3],:4]==data
Out[38]: 
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

I don't see how argsort or sort is going to help come up with this indexing order.

With np.lexsort I can order the rows of both arrays the same:

In [54]: data[np.lexsort(data.T),:]
Out[54]: 
array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 1, 0],
       [0, 1, 0, 1]])

In [55]: result[np.lexsort(result[:,:-1].T),:]
Out[55]: 
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1],
       [0, 1, 1, 0, 1],
       [0, 1, 0, 1, 0]])

I found by trial and error that I needed to use the transpose. We need to check the docs of lexsort to understand why.

A little more trial and error produces:

In [66]: i=np.lexsort(data.T)
In [67]: j=np.lexsort(result[:,:-1].T)
In [68]: j[i]
Out[68]: array([2, 1, 0, 3], dtype=int64)

In [69]: result[j[i],:]
Out[69]: 
array([[0, 1, 0, 0, 1],
       [1, 0, 0, 0, 0],
       [0, 1, 1, 0, 1],
       [0, 1, 0, 1, 0]])

This is a tentative solution. It needs to be tested on other samples. And needs to be explained.

Divakar · Answer 3 · 2016-04-10T22:38:24.437

Approach #1

Here's an approach considering each row as an indexing tuple and then finding the matching indices between data and result corresponding to those linear index equivalents. These indices would represent the new order of rows, which when indexed into result would give us the desired output. The implementation would look like this -

# Slice out from result everything except the last column       
r = result[:,:-1]       

# Get linear indices equivalent of each row from r and data
ID1 = np.ravel_multi_index(r.T,r.max(0)+1)
ID2 = np.ravel_multi_index(data.T,r.max(0)+1)

# Search for ID2 in ID1 and use those indices index into result
out = result[np.where(ID1[:,None] == ID2)[1]]

Approach #2

If all the rows from data are guaranteed to be in result, you can use another approach based on just argsort, like so -

# Slice out from result everything except the last column       
r = result[:,:-1]       

# Get linear indices equivalent of each row from r and data
ID1 = np.ravel_multi_index(r.T,r.max(0)+1)
ID2 = np.ravel_multi_index(data.T,r.max(0)+1)   

sortidx_ID1 = ID1.argsort()
sortidx_ID2 = ID2.argsort()
out = result[sortidx_ID1[sortidx_ID2]]

Sample run for a bit more generic case -

In [37]: data
Out[37]: 
array([[ 3,  2,  1,  5],
       [ 4,  9,  2,  4],
       [ 7,  3,  9, 11],
       [ 5,  9,  4,  4]])

In [38]: result
Out[38]: 
array([[ 7,  3,  9, 11, 55],
       [ 4,  9,  2,  4,  8],
       [ 3,  2,  1,  5,  7],
       [ 5,  9,  4,  4, 88]])

In [39]: r = result[:,:-1]
    ...: ID1 = np.ravel_multi_index(r.T,r.max(0)+1)
    ...: ID2 = np.ravel_multi_index(data.T,r.max(0)+1)
    ...: 

In [40]: result[np.where(ID1[:,None] == ID2)[1]] # Approach 1
Out[40]: 
array([[ 3,  2,  1,  5,  7],
       [ 4,  9,  2,  4,  8],
       [ 7,  3,  9, 11, 55],
       [ 5,  9,  4,  4, 88]])

In [41]: sortidx_ID1 = ID1.argsort()  # Approach 2
    ...: sortidx_ID2 = ID2.argsort()
    ...: 

In [42]: result[sortidx_ID1[sortidx_ID2]]
Out[42]: 
array([[ 3,  2,  1,  5,  7],
       [ 4,  9,  2,  4,  8],
       [ 7,  3,  9, 11, 55],
       [ 5,  9,  4,  4, 88]])

This answer works for a small dataset like the example I gave above, but when I use a larger example (5172x32 dataset), it gives me the error "ValueError: too many dimensions passed to ravel_multi_index". How should I resolve this? — ROBOTPWNS, Apr 11 '16 at 00:55
@ROBOTPWNS Calculate those ID1 and ID2, like so and see if it works : `ID1 = r.dot(r.max(0)+1); ID2 = data.dot(r.max(0)+1)`? — Divakar, Apr 11 '16 at 10:14
No that didn't work, I ended up just rebuilding the array before the sequence was mixed up and then unsorted based on those indicies. But thanks though. — ROBOTPWNS, Apr 14 '16 at 22:52

Sorting a numpy array based on data from another array

3 Answers3

Linked