How to measure the diversity (entropy) of a distribution in Python?

Question

I have a string that may consist of x different elements and I need to measure how diverse those elements are.

In order to calculate the ideal entropy of the string (in bits) — which is the most "diverse" string possible (where each of the x elements is different from one another), I use the code below:

    import math
    ideal = 'abcefghijk' # x = 10 number of elements, each is different
    probid = [ float(ideal.count(c)) / len(ideal) for c in dict.fromkeys(list(ideal)) ]
    entropy_ideal = - sum([ p * math.log(p) / math.log(2.0) for p in probid ])

Then I take a string that I need to compare to this "ideal" diversity and I calculate its entropy and then divide by the ideal one to find the diversity index for that distribution:

    string = 'abccbbbbcc'
    prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
    entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
    index = entropy/entropy_ideal
    print(index)

I need to categorize that index into "diverse" / "not diverse" and I found it's difficult given that the values are not always the same, depending on the length of the string.

Do you have any suggestion as to how I could amend the code or maybe use an existing python package that would be able to do what I need to do?

UPDATE

For example, for

string = 'ccca'
ideal = 'abcd'

I get

0.8112781244591328 # entropy of the string
0.4056390622295664 # relation

While for

string = 'caaaav'
ideal = 'abcdef'

I get

1.2516291673878228
0.4841962570206112

But it seems to me in a kind of intuitive way that the second string is only slightly more diverse than the first one (I'd cast it as low diversity).

maybe this: https://stackoverflow.com/questions/2979174/how-do-i-compute-the-approximate-entropy-of-a-bit-string — Martin, Apr 02 '19 at 19:41
I did an independent implementation of Shannon entropy in a different language and got the same numbers you did, so you don't have a problem with your existing code. If you're asking what folks think you should do instead, I think that's moving your question into being off-topic for SO. — pjs, Apr 02 '19 at 20:10

How to measure the diversity (entropy) of a distribution in Python?

0 Answers0