I have a string that may consist of x
different elements and I need to measure how diverse those elements are.
In order to calculate the ideal entropy of the string (in bits) — which is the most "diverse" string possible (where each of the x
elements is different from one another), I use the code below:
import math
ideal = 'abcefghijk' # x = 10 number of elements, each is different
probid = [ float(ideal.count(c)) / len(ideal) for c in dict.fromkeys(list(ideal)) ]
entropy_ideal = - sum([ p * math.log(p) / math.log(2.0) for p in probid ])
Then I take a string that I need to compare to this "ideal" diversity and I calculate its entropy and then divide by the ideal one to find the diversity index for that distribution:
string = 'abccbbbbcc'
prob = [ float(string.count(c)) / len(string) for c in dict.fromkeys(list(string)) ]
entropy = - sum([ p * math.log(p) / math.log(2.0) for p in prob ])
index = entropy/entropy_ideal
print(index)
I need to categorize that index into "diverse" / "not diverse" and I found it's difficult given that the values are not always the same, depending on the length of the string.
Do you have any suggestion as to how I could amend the code or maybe use an existing python package that would be able to do what I need to do?
UPDATE
For example, for
string = 'ccca'
ideal = 'abcd'
I get
0.8112781244591328 # entropy of the string
0.4056390622295664 # relation
While for
string = 'caaaav'
ideal = 'abcdef'
I get
1.2516291673878228
0.4841962570206112
But it seems to me in a kind of intuitive way that the second string
is only slightly more diverse than the first one (I'd cast it as low diversity).