Sorting numpy arrays using lexsort

Question

I am mainly interested in 2D arrays of shape Nx3 but the issue appears in arrays of shapes Nxm where m>1 as well. Specifically, I would like to sort an Nx3 array first based on its first column, then second, and finally third. So, assuming that we have array k given as

array([[0.90625, 0.90625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.90625, 0.65625],
       [0.15625, 0.90625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.65625, 0.65625],
       [0.15625, 0.65625, 0.40625],
       [0.65625, 0.15625, 0.90625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.90625, 0.40625],
       [0.65625, 0.40625, 0.40625],
       [0.15625, 0.15625, 0.90625],
       [0.40625, 0.40625, 0.40625],
       [0.65625, 0.90625, 0.40625],
       [0.90625, 0.15625, 0.40625]])

the desired (sorted) array should be

array([[0.15625, 0.15625, 0.90625],
       [0.15625, 0.65625, 0.40625],
       [0.15625, 0.90625, 0.40625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.40625, 0.40625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.65625, 0.65625],
       [0.40625, 0.90625, 0.40625],
       [0.40625, 0.90625, 0.65625],
       [0.65625, 0.15625, 0.90625],
       [0.65625, 0.40625, 0.40625],
       [0.65625, 0.90625, 0.40625],
       [0.90625, 0.15625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.90625, 0.90625, 0.15625]])

I thought I could achieve that by using np.lexsort but it seems I am probably missing something and is not working as expected. So far, I've been doing the following

In [28]: k[np.lexsort((k[:,2], k[:,1], k[:,0]))]
Out[28]: 
array([[0.15625, 0.65625, 0.40625],
       [0.15625, 0.15625, 0.90625],
       [0.15625, 0.90625, 0.40625],
       [0.40625, 0.65625, 0.65625],
       [0.40625, 0.90625, 0.40625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.40625],
       [0.40625, 0.90625, 0.65625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.65625, 0.15625, 0.90625],
       [0.65625, 0.90625, 0.40625],
       [0.65625, 0.40625, 0.40625],
       [0.90625, 0.40625, 0.90625],
       [0.90625, 0.15625, 0.40625],
       [0.90625, 0.90625, 0.15625]])

It seems that the first column is sorted properly but the others are not. A similar question was asked before but I believe the accepted answer (which is essentially what I am doing) does not work.

From what I understood after looking a little bit more into it, I think it has to do with the values of the array being floats.

E D I T

I found the answer to my problem. However, I'll add it as an "edit" rather than posting it as an answer because I believe this whole situation could possibly be avoided if I had mentioned a fine detail about matrix k in my original post. Matrix k is created from another matrix a, where a is essentially created by reading a matrix of floats with 16 decimals from a file. Now let's look at the workflow that led me to the solution.

In [6]: k=a[[1,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60]]

In [7]: k
Out[7]: 
array([[0.15625, 0.15625, 0.40625],
       [0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [8]: np.random.shuffle(k)

In [9]: k
Out[9]: 
array([[0.15625, 0.90625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625],
       [0.15625, 0.40625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.15625, 0.15625, 0.40625],
       [0.90625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625]])

In [10]: k[np.lexsort((k[:,2],k[:,1],k[:,0]))]
Out[10]: 
array([[0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.15625, 0.15625, 0.40625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [11]: k=np.round(k, 5)

In [12]: k[np.lexsort((k[:,2],k[:,1],k[:,0]))]
Out[12]: 
array([[0.15625, 0.15625, 0.40625],
       [0.15625, 0.40625, 0.15625],
       [0.15625, 0.65625, 0.15625],
       [0.15625, 0.90625, 0.15625],
       [0.40625, 0.15625, 0.15625],
       [0.40625, 0.40625, 0.15625],
       [0.40625, 0.65625, 0.15625],
       [0.40625, 0.90625, 0.15625],
       [0.65625, 0.15625, 0.15625],
       [0.65625, 0.40625, 0.15625],
       [0.65625, 0.65625, 0.15625],
       [0.65625, 0.90625, 0.15625],
       [0.90625, 0.15625, 0.15625],
       [0.90625, 0.40625, 0.15625],
       [0.90625, 0.65625, 0.15625],
       [0.90625, 0.90625, 0.15625]])

In [13]: np.savetxt(sys.stdout, a[[1,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60]], fmt='%.18f')
0.156250000000000000 0.156250000000000000 0.406250000000000000
0.156249999999999972 0.406250000000000000 0.156250000000000028
0.156249999999999972 0.656250000000000000 0.156250000000000028
0.156249999999999972 0.906250000000000000 0.156250000000000028
0.406250000000000000 0.156249999999999972 0.156250000000000028
0.406250000000000000 0.406250000000000000 0.156250000000000028
0.406249999999999944 0.656250000000000000 0.156250000000000028
0.406249999999999944 0.906250000000000000 0.156250000000000028
0.656250000000000000 0.156249999999999972 0.156250000000000028
0.656250000000000000 0.406249999999999944 0.156250000000000028
0.656250000000000000 0.656250000000000000 0.156250000000000028
0.656250000000000000 0.906250000000000000 0.156250000000000056
0.906250000000000000 0.156249999999999972 0.156250000000000028
0.906250000000000000 0.406249999999999944 0.156250000000000028
0.906250000000000000 0.656250000000000000 0.156250000000000056
0.906250000000000000 0.906250000000000000 0.156250000000000056

As can be seen by the above, it was all a matter of rounding errors. Apparently, everything was seemingly fine when printed with a few decimals, but when the file was read and matrix a was created, it was stored with inaccuracies after the 16th decimal place. Consequently, these inaccuracies were carried down to k when it was defined from a. Therefore, lexsort was giving the correct result from the beginning considering the real number that was stored in the matrix. Everything worked fine when I rounded matrix k.

Moral of the story: Always check the accuraccies of your values.

score 0 · Answer 1 · answered Feb 22 '20 at 21:32

0

I think numpy isn't flexible for this kind of operations, though I can't deny some kind of solution exists. I recommend you to use other packages such as pandas or numpy_indexed (assuming data is your array):

pandas

import pandas as pd
df = pd.DataFrame(data)
sorted_data = np.array(df.sort_values(by=[0,1,2]))

numpy_indexed

import numpy_indexed as npi
npi.sort(data)

Sources

For more general cases of usages you might like to check this answer

answered Feb 22 '20 at 21:32

mathfux

5,759
1
14
34

Thanks for your reply. The `pandas` solution is the same to what I am already doing. Pandas also gives the wrong output. I haven't tried `numpy_indexed` but I would like to keep it pure numpy. – alxkrt Feb 23 '20 at 18:47
I have checked an output of pandas and it was the same as you desire. What's the matter? – mathfux Feb 23 '20 at 19:31
As far as I know, the only way to perform sorting of multidimensional array on `numpy` is to access indices of another array which is the result of dimensionality reduction. Note that conversion to integers is needed first. It seems a lot of hardwork for me but you can continue reading here: https://stackoverflow.com/questions/38674027/find-the-row-indexes-of-several-values-in-a-numpy-array/38674038#38674038 – mathfux Feb 23 '20 at 19:39
thank you so much for offering an answer. Please look at my edited post to see what the problem was. Essentially, rounding errors caused the unexpected results – alxkrt Feb 23 '20 at 22:14
1

Wow, `lexsort` is really new for me. I compared it with `pandas` and it was, unexpectedly for me, more than 3 times faster. – mathfux Feb 24 '20 at 13:16
I am not 100% positive about that, but I think that `sort_values` from pandas essentially uses `lexsort` from numpy under the hood. – alxkrt Feb 24 '20 at 16:04

Sorting numpy arrays using lexsort

1 Answers1