3

I'm trying to choose only unique rows in numpy.ndarray (variable named cluster). When I define this variable explicitely like here:

cluster=np.array([[0.157,-0.4778],[0.157,-0.4778],[0.157,-0.4778],[-0.06156924,-0.21786049],[-0.06156924,-0.21786049],[0.02,-0.35]])

it works as it should:

[[ 0.157      -0.4778    ]
 [-0.06156924 -0.21786049]
 [ 0.02       -0.35      ]]

But unfortunately this variable cluster is a part of a bigger array (xtrans). So it can be defined only through array slicing:

splitted_clusters=[0,1,4,5,10]

cluster=xtrans[splitted_clusters]

The functions are the same, the data types are the same.

BUT!!! in latter case it works quite weird: it may add identical rows or it may not add them. As a result I have something like this:

    [[ 0.157      -0.4778    ]
     [ 0.157      -0.4778    ]
     [-0.06156924 -0.21786049]
     [ 0.02       -0.35      ]]

In my real example with an 44*2 array it adds 22 identical rows and it misses 23 of them (the scheme is quite strange too: it adds rows with indices 0,1,2,4,9,11,12,18 etc). But the number of added identical rows differs. AND it is supposed to add only ONE (the first) row of these 44 rows.

As for method of choosing unique rows firstly I used one from this thread Find unique rows in numpy.array

b =np.ascontiguousarray(cluster).view(np.dtype((np.void, cluster.dtype.itemsize * cluster.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_cl = cluster[idx]

Then I've tried my code to check:

unique_cl=np.array([0,0])
for i in range(cluster.shape[0]):
    if i==0:
        unique_cl=np.vstack([cluster[i,:]])
    elif cluster[i,:].tolist() not in unique_cl.tolist():
        unique_cl=np.vstack([unique_cl,cluster[i,:]]) 

The results are the same and I really have no idea why. I would be very grateful for any help/advice/suggestion/idea.

The problem was in floats. When I rounded values of array to 7 decimal places everything works as should. Thank Eelco Hoogendoorn for this idea.

Community
  • 1
  • 1
Nataly
  • 323
  • 2
  • 10
  • Is `b` the same? It looks like `b` is the same data, but each row is viewed as a 16 bytes 'void' element. That allows `unique` to do its flattened sort and selection. – hpaulj Apr 02 '16 at 20:37
  • @hpaulj I suppose yes, as there is no other `b` in this code. It's of type `'numpy.ndarray'` as well but when I try to print it I see strange symbols and I don't know how encode/decode them: `[��|гY�? 9��v���? � h"lx�? @ ��|гY�? 9��v���? � h"lx�? ��|гY�? 9��v���? � h"lx�? �K7�A`�? 9��v���? F����x�? ��|гY�? 9��v���? � h"lx�? @ ��|гY�? 9��v���? � h"lx�? ��|гY�? 9��v���? � h"lx�? @ @ @]` – Nataly Apr 02 '16 at 20:44
  • What is the shape and dtype of the `b` generated from `xtrans[splitted_clusters]`? We can't debug your problem with out a sample of `xtrans` or idea of how that gives transformed to produce the new `b`. – hpaulj Apr 02 '16 at 20:49
  • 1
    Could this be a floating-point precision issue? ie, the floats look the same when printed, but are actually not bitwise-identical? Try using np.round and see if that makes a difference. – Eelco Hoogendoorn Apr 02 '16 at 21:07
  • 1
    Attempting to perform equality tests on general floating point values is tricky. Try `xtrans[i,:]==xtrans[j,:]` for any two rows that you think are identical. Or look `xtrans[i,:]-xtrans[j,:]`. The rows might not be as unique as you think. – hpaulj Apr 02 '16 at 22:39
  • @EelcoHoogendoorn thank you, you are right. Now it works perfectly!!! – Nataly Apr 03 '16 at 07:03

3 Answers3

2

You can do it by converting list to set.

 aList = [[ 0.157, -0.4778], [ 0.157, -0.4778],[-0.06156924,
 -0.21786049], [ 0.02, -0.35]]
  1. Make a list of tuples from list of lists, otherwise you will not be able to create set or dictionary from it.
  2. Set constructor will do rest for you

    set([tuple(a) for a in aList])

Output:

set([(-0.06156924, -0.21786049), (0.02, -0.35), (0.157, -0.4778)])
Community
  • 1
  • 1
Rudziankoŭ
  • 10,681
  • 20
  • 92
  • 192
  • Then, of course, you can convert it back to two dimensional `list` – Rudziankoŭ Apr 02 '16 at 20:30
  • thank you for this idea, but I need to save the original indices of the array. For example in first code in my question the indices are in variable `idx` – Nataly Apr 03 '16 at 06:53
1

The numpy_indexed package (disclaimer: I am its author) implements functionality of this kind, in a manner similar to the solution you posted. But hopefully, its units tests will prove themselves useful, and things work as expected... Could you give it a try on your dataset, and see if it has the same problem?

import numpy_indexed as npi
npi.unique(cluster)
# try this as well, to see if fp representation has something to do with it
npi.unique(cluster.round(4))   
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
0

A solution to finding unique rows in your numpy array would be

In [13]: uniq_vals, counts = np.unique(cluster, axis=0, return_counts=True)

In [14]: uniq_vals
Out[14]:
array([[-0.06156924, -0.21786049],
       [ 0.02      , -0.35      ],
       [ 0.157     , -0.4778    ]])

In [15]: counts
Out[15]: array([2, 1, 3], dtype=int64)

The option return_counts allows you to obtain the counts of unique rows.

This solution is explained in Find unique rows in numpy.array

Jon
  • 2,373
  • 1
  • 26
  • 34