1

I am using Latent Dirichlet Allocation function in sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html The output is a 2d numpy array of floats, in which each row is a probability distribution. However, some rows does not add up to exactly 1.0, for example:

row index,  sum
5           0.9999999999999999
6           0.9999999999999999
7           1.0000000000000002
9           0.9999999999999999
10          0.9999999999999999
12          0.9999999999999999
13          1.0000000000000002
 ...

I am having problem in the next steps in my project due to this issue. Specifically, the 2d array is saved as pandas dataframe and stored as a .csv file. Another R script loads the matrix from the .csv file and computes the Total Variational Distance between pairs of rows by applying a package function distrEx::TotalVarDist(), which actually adds them up and raise error if the sum is not 1.0. This will need sum(row) == 1.0 for each row.

How can I ensure that all rows add up to exactly 1.0 ?

Given this matrix, I can fix by adding / subtracting the tiny error to the first number in the row, but this is obviously a very bad practice.

How can I fix?

Kid_Learning_C
  • 2,605
  • 4
  • 39
  • 71
  • Can you expand on `I am having problem in the next steps in my project due to this issue.`? These errors are inevitable with `float` computations. See also [Is floating point math broken?](https://stackoverflow.com/questions/588004/is-floating-point-math-broken). – jpp Sep 26 '18 at 21:59
  • 2
    Looks like you're just running into floating point math issues. Use `np.isclose` if you need to make sure they are close to 1 – user3483203 Sep 26 '18 at 22:00
  • 6
    Basically, you can't: since floating point summation is order-dependant, there is no way to guarantee that a set of values summed one way will give the same result when summed another. – Simon Byrne Sep 26 '18 at 22:01
  • @jpg Specifically, the 2d array is saved as pandas dataframe and stored as a .csv file. Another program of RShiny loads the matrix from the .csv file and computes the Total Variational Distance between pairs of rows. This will need sum(row) == 1.0 for each row, otherwise it will report an error saying numbers don't sum to 1.0 – Kid_Learning_C Sep 26 '18 at 22:25
  • 1
    [One related question.](https://stackoverflow.com/questions/17641300/rounding-floats-so-that-they-sum-to-precisely-1/17643222#17643222) – Eric Postpischil Sep 26 '18 at 23:11
  • 1
    To elaborate on @SimonByrne's comment: Not only do floating point inaccuracies cause your result not sum to 1, but the very concept of having an exact sum is implementation-defined. Let's say you have a function `sum_forward`, that iterates over your array and computes the sum from the first element onwards. You can tweak your data till `sum_forwards(data) == 1` (exactly). But now you pass on your data to my library, which starts with `assert sum_backwards(data) == 1`, which fails! To solve your problem, you need to know exactly the algorithm used by `distrEx::TotalVarDist` to compute the sum. – Eric Sep 27 '18 at 01:56

0 Answers0