Fixing probabilities, which do not sum to 1 in the matrix of words

Question

I created a matrix, using answers from these questions - question 1 and question 2. Similar questions about this error did not help to resolve.

But probabilities exceed 1 - ValueError: probabilities do not sum to 1

Please let me know how can I share with you a piece of the df for the reproducibility.

I generated the concurrence matrix, using this code

# Create matrix
my_df = pd.DataFrame(0, columns = words, index = words)
for k,v in frequency_list.items():
my_df.at[k[0],k[1]] = v

which gives me the matrix 10000*10000.

Then I converted into frequencies

row_sums = my_df.values.sum(axis = 1)
row_sums[row_sums == 0] = 1
my_prob = my_df/row_sums.reshape((-1,1)) 
my_prob

When I print one word

my_prob.sum().tail(30)

I have a probability above 1.

“thy               0.000000
“till              0.002538
**“to              1.109681**

Tried to normalize

Pick the word the and generate a list

word_the = my_string_prob['the'].tolist()

Try to normalize probabilities

sum_of_elements = sum(word_the)
a = 1/sum_of_elements
my_probs_scaled = [e*a for e in word_the]
my_probs_scaled
sum(my_probs_scaled)
### Output 1.000000000000005

This code worked on a smaller matrix, which was not so big and complex in one of questions above. Thanks!

You can use `from decimal import Decimal as D` to avoid floating point errors — Parth Shah, Jul 26 '20 at 18:59
@ParthShah, thanks any tips where to use it in my code? Thanks! — Anakin Skywalker, Jul 26 '20 at 19:01

Parth Shah · Accepted Answer · 2020-07-26T19:37:18.760

1

You can control the precision of your floating point numbers using decimal in python. Consider the following as an example:

from decimal import Decimal as D
from decimal import getcontext
getcontext().prec = 8

word_the = [9, 4, 5, 4]
sum_of_elements = sum(word_the)
a = D(1/sum_of_elements)
my_probs_scaled = [D(e)*a for e in word_the]
print(my_probs_scaled)
print(sum(my_probs_scaled))

And the output is:

[Decimal('0.40909091'), Decimal('0.18181818'), Decimal('0.22727273'), Decimal('0.18181818')]
1.0000000

You can play around with the parameters, including the precision.

edited Jul 26 '20 at 19:37

answered Jul 26 '20 at 19:12

Parth Shah

1,237
10
24

TypeError: unsupported operand type(s) for *: 'float' and 'decimal.Decimal' – Anakin Skywalker Jul 26 '20 at 19:30
1

Works on my machine. Strange. Editing it, you can try again. – Parth Shah Jul 26 '20 at 19:35
Decimal('0E-55'), Decimal('0E-55'), Decimal('0E-55'), Decimal('0E-55'), Decimal('0E-55'), Decimal('0E-55'), Decimal('0.0030873908')] 1.0000004 - as you can see, worked this time, but still above 1 :( – Anakin Skywalker Jul 26 '20 at 19:43
1

Yes, that's because the precision is 8. If you lower it to, say 6, does it work? – Parth Shah Jul 26 '20 at 19:48

Fixing probabilities, which do not sum to 1 in the matrix of words

1 Answers1