29

I've got a ndarray of floating point values in numpy and I want to find the unique values of this array. Of course, this has problems because of floating point accuracy...so I want to be able to set a delta value to use for the comparisons when working out which elements are unique.

Is there a way to do this? At the moment I am simply doing:

unique(array)

Which gives me something like:

array([       -Inf,  0.62962963,  0.62962963,  0.62962963,  0.62962963,
    0.62962963])

where the values that look the same (to the number of decimal places being displayed) are obviously slightly different.

robintw
  • 27,571
  • 51
  • 138
  • 205

6 Answers6

31

Another possibility is to just round to the nearest desirable tolerance:

np.unique(a.round(decimals=4))

where a is your original array.

Edit: Just to note that my solution and @unutbu's are nearly identical speed-wise (mine is maybe 5% faster) according to my timings, so either is a good solution.

Edit #2: This is meant to address Paul's concern. It is definitely slower and there may be some optimizations one can make, but I'm posting it as-is to demonstrate the stratgey:

def eclose(a,b,rtol=1.0000000000000001e-05, atol=1e-08):
    return np.abs(a - b) <= (atol + rtol * np.abs(b))

x = np.array([6.4,6.500000001, 6.5,6.51])
y = x.flat.copy()
y.sort()
ci = 0

U = np.empty((0,),dtype=y.dtype)

while ci < y.size:
    ii = eclose(y[ci],y)
    mi = np.max(ii.nonzero())
    U = np.concatenate((U,[y[mi]])) 
    ci = mi + 1

print U

This should be decently fast if there are many repeated values within the precision range, but if many of the values are unique, then this is going to be slow. Also, it may be better to set U up as a list and append through the while loop, but that falls under 'further optimization'.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
  • 1
    +1: Rounding will correctly unify some corner cases where flooring may fail, and it is faster. – unutbu Mar 25 '11 at 04:15
  • This answer worked great for me. As an additional bonus, I was able to find unique *rows* in a 2D NumPy array where you can add `axis=0` in the `np.unique` call. Thanks! – rayryeng Aug 10 '22 at 21:14
13

Doesn't floor and round both fail the OP's requirement in some cases?

np.floor([5.99999999, 6.0]) # array([ 5.,  6.])
np.round([6.50000001, 6.5], 0) #array([ 7.,  6.])

The way I would do it is (and this may not be optimal (and is surely slower than other answers)) something like this:

import numpy as np
TOL = 1.0e-3
a = np.random.random((10,10))
i = np.argsort(a.flat)
d = np.append(True, np.diff(a.flat[i]))
result = a.flat[i[d>TOL]]

Of course this method will exclude all but the largest member of a run of values that come within the tolerance of any other value, which means you may not find any unique values in an array if all values are significantly close even though the max-min is larger than the tolerance.

Here is essentially the same algorithm, but easier to understand and should be faster as it avoids an indexing step:

a = np.random.random((10,))
b = a.copy()
b.sort()
d = np.append(True, np.diff(b))
result = b[d>TOL]

The OP may also want to look into scipy.cluster (for a fancy version of this method) or numpy.digitize (for a fancy version of the other two methods)

Paul
  • 42,322
  • 15
  • 106
  • 123
  • I like the idea in principle, but that last caveat seems like an even larger divergence from the OP's requirements. – JoshAdel Mar 25 '11 at 12:35
  • 1
    @JoshAdel: I have to assume the OP's data is naturally clustered (by the example, they seem very tightly clustered around certain values) or else the request doesn't have much meaning. Assuming that, digitizing the OP's data at arbitrary thresholds (that could split the clusters) seems to do more harm than good. – Paul Mar 25 '11 at 14:30
  • What would be useful is the ability to do fixed point precision truncation. I have a way in mind to solve this correctly (albeit in a slower way), but I won't be able to post it until later. – JoshAdel Mar 25 '11 at 14:54
6

I have just noticed that the accepted answer doesn't work. E.g. this case:

a = 1-np.random.random(20)*0.05
<20 uniformly chosen values between 0.95 and 1.0>
np.sort(a)
>>>> array([ 0.9514548 ,  0.95172218,  0.95454535,  0.95482343,  0.95599525,
             0.95997008,  0.96385762,  0.96679186,  0.96873524,  0.97016127,
             0.97377579,  0.98407259,  0.98490461,  0.98964753,  0.9896733 ,
             0.99199411,  0.99261766,  0.99317258,  0.99420183,  0.99730928])
TOL = 0.01

Results into:

a.flat[i[d>TOL]]
>>>> array([], dtype=float64)

Simply because none of the values of the sorted input array are sufficiently spaced to be at least “TOL“ appart, while the correct result should be:

>>>> array([ 0.9514548,  0.96385762,  0.97016127,  0.98407259,
             0.99199411])

(although it depends how you decide which value to take within the “TOL“)

You should use the fact that integers don't suffer from such machine precision effect:

np.unique(np.floor(a/TOL).astype(int))*TOL
>>>> array([ 0.95,  0.96,  0.97,  0.98,  0.99])

which performs 5 times faster than the proposed solution (according to %timeit).

Note that “.astype(int)“ is optional, although removing it deteriorates the performance by a factor of 1.5, given that extracting uniques out of an array of int is much faster.

You might want to add half the “TOL“ to the results of uniques, to compensate for the flooring effect:

(np.unique(np.floor(a/TOL).astype(int))+0.5)*TOL
>>>> array([ 0.955,  0.965,  0.975,  0.985,  0.995])
Guillaume S
  • 190
  • 2
  • 4
3

In the current version of NumPy (1.23), numpy.unique has an optional parameter return_index to return indices of the first occurrence of each unique value. So you can simply use numpy.unique with return_index=True on a rounded array and index the original array to obtain the original, non-rounded values. Like this:

decimals = 3
X_unique_with_tolerance = X[np.unique(X.round(decimals), return_index=True)[1]].shape
Redgen
  • 129
  • 1
  • 6
2

How about something like

np.unique1d(np.floor(1e7*x)/1e7)

where x is your original array.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 2
    just to note that `np.unique1d` is deprecated in version 1.4 and will be removed in 1.5. – JoshAdel Mar 25 '11 at 00:19
  • I actually may have those versions slightly wrong, but it's definitely not in the newest documentation. `np.unique` is recommended in its place. – JoshAdel Mar 25 '11 at 04:20
1

I just added support for this to npx (a small numpy extension package of mine).

import npx

a = [0.1, 0.15, 0.7]
a_unique = npx.unique(a, tol=2.0e-1)

assert all(a_unique == [0.1, 0.7])
Nico Schlömer
  • 53,797
  • 27
  • 201
  • 249
  • Interesting, thanks! Worth noting that this still runs into the same floating-point issues with bin edges, leading to surprising results. For example, `npx.unique([1.1+2.2, 3.3], tol=0.1)` gives me two results, but if I use a tighter tolerance: `npx.unique([1.1+2.2, 3.3], tol=0.01)`, I only get one. (And yes, this is an ill-defined problem in general, so there is no good general solution.) – Mark Dickinson Nov 18 '21 at 15:10
  • Maybe it would be less surprising if the multiples of `tol` formed the centers of the bins rather than their end points? There would still be surprises, but they might be less frequent or less obvious. – Mark Dickinson Nov 18 '21 at 15:13
  • Interesting edge case! I've just fixed that in npx, but yeah, there may be others. `round()`ing would just move the problem elsewhere. – Nico Schlömer Nov 18 '21 at 16:56