I am using Latent Dirichlet Allocation function in sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html The output is a 2d numpy array of floats, in which each row is a probability distribution. However, some rows does not add up to exactly 1.0, for example:
row index, sum
5 0.9999999999999999
6 0.9999999999999999
7 1.0000000000000002
9 0.9999999999999999
10 0.9999999999999999
12 0.9999999999999999
13 1.0000000000000002
...
I am having problem in the next steps in my project due to this issue.
Specifically, the 2d array is saved as pandas dataframe and stored as a .csv file. Another R script loads the matrix from the .csv file and computes the Total Variational Distance between pairs of rows by applying a package function distrEx::TotalVarDist()
, which actually adds them up and raise error if the sum is not 1.0. This will need sum(row) == 1.0 for each row.
How can I ensure that all rows add up to exactly 1.0 ?
Given this matrix, I can fix by adding / subtracting the tiny error to the first number in the row, but this is obviously a very bad practice.
How can I fix?