How can I store float probabilities to a file so exactly that they sum up to 1?

Question

I want to store a numpy array to a file. This array contains thousands of float probabilities which all sum up to 1. But when I store the array to a CSV file and load it back, I realise that the numbers have been approximated, and their sum is now some 0.9999 value. How can I fix it?

(Numpy's random choice method requires probabilities to sum up to 1)

What is the precision of those values? How do you save it to a CSV? How do you read the CSV back? Have you read [What Every Computer Scientist Should Know About Floating-Point Arithmetic](//docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html)? — Pranav Hosangadi, Jan 04 '23 at 18:25
For example, if I create a `arr = np.random.random((1000,))`, then normalize it `arr = arr / np.linalg.norm(arr)`, the sum of the squares should be `1`. However, `sum(arr**2)` is sometimes _not_ 1, because of floating point precision errors — Pranav Hosangadi, Jan 04 '23 at 18:27
`csv` is a text format, so the save and load is limited by the float precision of the formatting. `np.save/load` writes binary data, a copy of the array's data buffer. I'd expect it to preserve the sum. — hpaulj, Jan 04 '23 at 20:33
@Barmar: Please do not promiscuously close floating-point questions as duplicates of [that question](https://stackoverflow.com/questions/588004/is-floating-point-math-broken). Working with floating-point does not mean simply accepting that rounding errors occur, nor is somebody asking about a specific task in floating-point asking why floating-point has rounding errors. — Eric Postpischil, Jan 04 '23 at 21:34
@Eric I didn't cast a close vote for duplicate. That was all Barmar :) My close vote was needs-debugging-details, because well, look at the question — Pranav Hosangadi, Jan 04 '23 at 21:35
Re “But when I store the array to a CSV file and load it back, I realise that the numbers have been approximated, and their sum is now some 0.9999 value. How can I fix it?”: I do not use numpy, but, with Python, you should be able to print with enough significant digits to restore the original value when reading. Also, I would look into ways to format as with C’s “%a” format, which prints a hexadecimal representation with enough information to restore the original number. (It does not look like this is supported in the `"format"%number` form in Python 3.9.6, but maybe there is some other way?) — Eric Postpischil, Jan 04 '23 at 21:40
@Giulio Cusenza, "This array contains thousands of float probabilities which all sum up to 1." --> Doubtful. Post an example file and the code used to generate it. — chux - Reinstate Monica, Jan 05 '23 at 00:58
@chux-ReinstateMonica Um, the antagonistic tone isn't really helpful. Anyway the notion of a lot of numbers which nominally sum to 1 is hardly implausible. — Robert Dodier, Jan 05 '23 at 18:37
The best solution is to simply not store a csv, and use `numpy.save` and `numpy.load` to use a better serialization format — juanpa.arrivillaga, Jan 05 '23 at 18:59
@RobertDodier Without tangible data, answers tend to be theoretical. With true data (even just maybe a dozen of the thousands of float), one can at least see an example and offer a real solution and extend that to others. — chux - Reinstate Monica, Jan 05 '23 at 21:41
@chux-ReinstateMonica that is not the matter of the question. Before saving the matrix, I do use it and it works fine because they sum up to 1. Then I saved it, reloaded it and I had a precision problem, which I solved by normalising the data. Providing an example of my code would just make things messier here. — Giulio Cusenza, Jan 05 '23 at 23:45
@GiulioCusenza: There are conflicts or ambiguity here. You say when the array is reread, the sum is “some 0.9999 value” and “Numpy's random choice method requires probabilities to sum up to 1”. But [Robert Dodier](https://stackoverflow.com/a/75023082/298225) says `Numpy` does not actually require the probabilities sum to one. You should resolve the conflict. Providing a [mre] is one way to do that. If you do not do that, you should provide sufficient information to determine which statement is true. Otherwise, the problem is likely to be closed. — Eric Postpischil, Jan 05 '23 at 23:58
There is no conflict. The value I am talking about starts with 0.9999 but has some other number following. This is a higher error that .choice() accepts. So, again, we are in front of a precision error. The two answers I got are both helpful, so the problem is resolved. — Giulio Cusenza, Jan 06 '23 at 13:51

score 1 · Accepted Answer · answered Jan 04 '23 at 23:02

1

Try using np.savetxt.

import numpy as np

arr = np.random.random(1000)
arr /= arr.sum()
np.savetxt('arr.csv', arr, delimiter=',')

arr = np.loadtxt('arr.csv')
print(arr.sum())
# >>> 1.0

answered Jan 04 '23 at 23:02

alpelito7

435
2
10

score 0 · Answer 2 · answered Jan 05 '23 at 18:48

0

Due to floating point arithmetic errors, you can get tiny errors in what seem like ordinary calculations. However, in order to use the choice function, the probabilities don't need to be perfect.

On reviewing the code in the current version of Numpy as obtained from Github, I see that the tolerance for the sum of probabilities is that sum(p) is within sqrt(eps) of 1, where eps is the double precision floating point epsilon, which is approximately 1e-16. So the tolerance is about 1e-8. (See lines 955 and 973 in numpy/random/mtrand.pyx.)

Farther down in mtrand.pyx, choice normalizes the probabilities (which are already almost normalized) to sum to 1; see line 1017.

My advice is to ensure that all 16 digits are stored in the csv, then when you read them back, the error in the sum will be much smaller than 1e-8 and choice will be happy. I think other people commenting here have posted some advice about how to print all digits.

answered Jan 05 '23 at 18:48

Robert Dodier

16,905
2
31
48

1

Binary floating-point does not have “all 16 digits”; it does not have decimal digits at all. For the commonly used IEEE-754 binary64 format, the smallest number of decimal digits guaranteed to have enough information to reproduce the original binary64 number is 17, not 16. – Eric Postpischil Jan 05 '23 at 19:31
Robert Dodier, To add: that is 17 significant digits, not 17 digits after the decimal place. – chux - Reinstate Monica Jan 05 '23 at 21:46
@EricPostpischil There's no need to talk to me like I have no idea what's going on. Thanks for understanding. The threshold for precision is actually much larger than the floating point epsilon (in fact, it's the square root of it), so this business about 16 vs. 17 digits is just some pointless quibbling. – Robert Dodier Jan 05 '23 at 22:22

How can I store float probabilities to a file so exactly that they sum up to 1?

2 Answers2