6

I have an array of random floats and I need to compare it to another one that has the same values in a different order. For that matter I use the sum, product (and other combinations depending on the dimension of the table hence the number of equations needed).

Nevertheless, I encountered a precision issue when I perform the sum (or product) on the array depending on the order of the values.

Here is a simple standalone example to illustrate this issue :

import numpy as np

n = 10
m = 4

tag = np.random.rand(n, m)

s1 = np.sum(tag, axis=1)
s2 = np.sum(tag[:, ::-1], axis=1)

# print the number of times s1 is not equal to s2 (should be 0)
print np.nonzero(s1 != s2)[0].shape[0]

If you execute this code it sometimes tells you that s1 and s2 are not equal and the differents is of magnitude of the computer precision.

The problem is I need to use those in functions like np.in1d where I can't really give a tolerance...

Is there a way to avoid this issue?

Divakar
  • 218,885
  • 19
  • 262
  • 358
Thomas Leonard
  • 1,047
  • 11
  • 25
  • You can never expect floating point operations to be exact. You should change your algorithm to accommodate some errors. Otherwise there are some fancier sums in the `statistics` module, but that's not really the point now. Especially with numpy, where vectorization should be a basic tool, you can never rely on the order of arithmetic operations. – Andras Deak -- Слава Україні Sep 28 '16 at 21:01
  • How/where are you using `np.in1d`? For the listed code, you can use `np.isclose(s1,s2)`. – Divakar Sep 28 '16 at 21:03
  • @Divakar I don't use it in the exemple but in my actual algorithm I would use ``np.in1d(s1, s2)`` and other equivalent arrays obtained through other operations like product, etc... – Thomas Leonard Sep 28 '16 at 21:30
  • @Andras Deak I think I need to rethink the choice of my tag... Random floats is probably not a good idea as you point out but it was a convenient one since I am doing this on a very large array and I perform operations like ``tag**3`` which I was afraid would lead to overflows if using integers... I'm opened to suggestions (my tag array must have non repeated values)... – Thomas Leonard Sep 28 '16 at 21:37
  • No, no, the solution is usually not to restrict your data from floats:) My point was the same as that of Divakar: don't use exact tests, use closeness tests. But you seem to be already aware of this, which is why I only suggested to restructure your algorithm to avoid exact tests. – Andras Deak -- Слава Україні Sep 28 '16 at 21:39

1 Answers1

7

For the listed code, you can use np.isclose and with it tolerance values could be specified too.

Using the provided sample, let's see how it could be used -

In [201]: n = 10
     ...: m = 4
     ...: 
     ...: tag = np.random.rand(n, m)
     ...: 
     ...: s1 = np.sum(tag, axis=1)
     ...: s2 = np.sum(tag[:, ::-1], axis=1)
     ...: 

In [202]: np.nonzero(s1 != s2)[0].shape[0]
Out[202]: 4

In [203]: (~np.isclose(s1,s2)).sum() # So, all matches!
Out[203]: 0

To make use of tolerance values in other scenarios, we need to work on a case-by-case basis. So, let's say for an implementation that involve elementwise comparison like in np.in1d, we can bring in broadcasting to do those elementwise equality checks for all elems in first input against all elems in the second one. Then, we use np.abs to get the "closeness factor" and finally compare against the input tolerance to decide the matches. As needed to simulate np.in1d, we do ANY operation along one of the axis. Thus, np.in1d with tolerance using broadcasting could be implemented like so -

def in1d_with_tolerance(A,B,tol=1e-05):
    return (np.abs(A[:,None] - B) < tol).any(1)

As suggested in the comments by OP, we can also round floating-pt numbers after scaling them up and this should be memory efficient, as being needed for working with large arrays. So, a modified version would be like so -

def in1d_with_tolerance_v2(A,B,tol=1e-05):
    S = round(1/tol)
    return np.in1d(np.around(A*S).astype(int),np.around(B*S).astype(int))

Sample run -

In [372]: A = np.random.rand(5)
     ...: B = np.random.rand(7)
     ...: B[3] = A[1] + 0.0000008
     ...: B[6] = A[4] - 0.0000007
     ...: 

In [373]: np.in1d(A,B) # Not the result we want!
Out[373]: array([False, False, False, False, False], dtype=bool)

In [374]: in1d_with_tolerance(A,B)
Out[374]: array([False,  True, False, False,  True], dtype=bool)

In [375]: in1d_with_tolerance_v2(A,B)
Out[375]: array([False,  True, False, False,  True], dtype=bool)

Finally, on how to make it work for other implementations and use cases - It would depend on the implementation itself. But for most cases, np.isclose and broadcasting should help.

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This implementation of in1d with tolerance uses a lot more memory (for big arrays)... so I did instead a ``np.around`` of the arrays I want to compare with a number of decimals close to the floating point precision and it seems to do the job. Thank you. – Thomas Leonard Sep 28 '16 at 22:42
  • @thomleo Good point! Added that into the post, hope that's okay. – Divakar Sep 28 '16 at 23:09