0

I have a task where I have a list of certain values: l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]. I have a formula for computing a kind of probability on this list as the following (the probability is high in case there is many different values in the list and low if there are few kind of values):

$ p = - \sum_{i=1}^m f_i log_m f_i $

where m is the length of the list, $f_i$ is the frequency of the ith element of the list.

I want to code this in Python with the following:

from math import log
from collections import Counter

-sum([loc*log(loc, len(set(l))) for loc in Counter(l).values()])

But I somehow suspect that this is not the right way. Any better idea? Additionally: I do not understand the negative sign in the formula, what is the explanation of this?

Fredrik
  • 411
  • 1
  • 3
  • 14
  • Would you mind formatting your text so the formulas show up? – tchar Aug 18 '21 at 13:44
  • I tried, as you see, but Latex formulas don't show. – Fredrik Aug 18 '21 at 13:45
  • @tchar afaik SO doesn't have latex formatting and the feature request has been declined before – h4z3 Aug 18 '21 at 13:46
  • I think you mean that `m` is the length of your alphabet, right? Also are you sure its log_m in the equation? – NMme Aug 18 '21 at 13:49
  • The code itself (assuming the formula is right, I'm not gonna question it - I see it part of the given task) seems fine. I can only see two things: 1. Remove [] to make it work on generator (can make a difference in case of very long iterable in the comprehension). 2. `len(set(l))` gets re-calculated with every iteration - you can save the len before the loop (also will make a difference in looong iterables). Otherwise everything seems pretty good - nice use of Counter and set to get uniques! – h4z3 Aug 18 '21 at 13:49
  • @NMme: yes, log_m is in the formula this way. – Fredrik Aug 18 '21 at 13:51
  • 1
    Just to mention the elephant in the room and not if len(set(l)) or len(Counter) is O(n) or O(1) depending on python version or python iplementation, your implementation does not follow the formula you have provided. In your formula you loop every element of the list (even if that is encountered twice) and take the log with base the length of the list. In your implementation you loop for each value of the Counter, meaning the occurence of each distinct element of the list and not each element. You have to decide which one is correct. – tchar Aug 18 '21 at 15:03

2 Answers2

0

Here an alternative way to calculate the Entropy of the list using numpy:

import numpy as np

arr = np.array(l)
elem, c = np.unique(arr, return_counts=True)

# occurrences to probabilities
pc = c / c.sum()

# calculate the entropy (and account for log_m)
entropy = -np.sum(pc * np.log(pc)) * (1/np.log(len(c)))
NMme
  • 461
  • 3
  • 12
0

Although the numpy array is a better solution, in case you don't want to use numpy:

  1. You would be better if you saved the counter and use len(Counter) instead of len(set(l)), so that you don't recalculate in in every iteration. len(Counter) is the same as len(set(l)), but does not get recalculated in every iteration (I assume you use cpython3.x )
  2. If you don't get the desired result, then probably your formula is wrong
  3. In your code you use len(set(l)) and not len(l) and you iterate over the frequencies, not the list which is not what you describe in your formula.
  4. You don't need to wrap the expression inside sum within a list since you only need to iterate over it once (Generator expressions vs. list comprehensions)

EDIT: As to why you get a negative result, this is expected You sum over f[i] * log(f[i]) >= 0

  • f[i] >= 1: The frequency of ith element of the list
  • log(f[i]) >= 0 because f[i] >= 1: The log of each frequency in any base (base doesn't matter).

And then take the negative of that. The result will always be less that or equal to 0.

from math import log
from collections import Counter

l = ["alpha", "beta", "beta", "alpha", "gamma", "alpha", "alpha"]
f = Counter(l)
# This is from your code
p1 = -sum(f[e] * log(f[e], len(f)) for e in f)
# This is from your formula
p2 = -sum(f[e] * log(f[e], len(l)) for e in l)
print(p1, p2)

tchar
  • 838
  • 9
  • 12
  • The OP's code is already the same thing, but as one-liner with comprehension rather than multi-liner with map. ;) – h4z3 Aug 18 '21 at 13:54
  • But the formula/code for a list of few hundred elements with 3-4 unique values gives result as -47048.92377551. What kind of "probability" it is? – Fredrik Aug 18 '21 at 13:55
  • @h4z3 Yeah map is better, than comprehension, because it does not recalculate things without needing them and does not create lists without needing them. I don't understand the "one-liner" desease, but whatever. Also I like map due to its chainability with other map and filter – tchar Aug 18 '21 at 13:56
  • @tchar if you use sum with generator comprehension instead of list comprehension, it's also lazy. So basically 2 character difference (I already recommended and _explained_ (you know, we're here to explain stuff, not just give code without anything) the difference in the comments to the question itself) makes the code the same to yours. If you were that much for saving, you'd also save len as you re-calculate it with each map element (which I also mentioned in the comments to the question). – h4z3 Aug 18 '21 at 14:09
  • @tchar `len(frequencies)` is executed with each run of map because it's a call, you didn't save the value. Now: depending on container's implementation and its implementation, it *might* be O(1) - but doesn't have to be. ;) Every time someone complains about "efficiency" is such code snippets, I try to check if they are *sure* stuff like that works like they think. ;) – h4z3 Aug 18 '21 at 14:20
  • @tchar Btw, you didn't reply to my comment how on SO we try to *explain* *why* we do stuff, and not just give code. OP's question wasn't "how to do that", it was "is my code okay", so with such a big change explanation should be the main part of the answer. – h4z3 Aug 18 '21 at 14:20
  • @h4z3 AFAIK Counter subclasses dict and len(dict) is O(1) .Regardless of that, the original point was that oneliner + comprehension > map + multiline, in which I disagree, because `s = map(str.strip, s);s = filter(None, s)` will always look cleaner than `s = (ss.strip() for ss in s if ss.strip() != '')` which restrips the ss. And also I don't have to do weird formatting to satisfy PEP due to that. As far as the Original question. I read the title and the formula and gave a code that answers the title. My bad for not reading though it all, guess I am too bored. – tchar Aug 18 '21 at 14:24
  • @fredrik This is normal. Each sum's term is a multiplication of a frequency f[x]>= 1 with the log(f[x], y) where f[x] >= 1. So log(f[x], y) >= 0, meaning that each term in the sum is >= 0. Taking the negative of that, will give you p <= 0, assuming I did not make any mistake. – tchar Aug 18 '21 at 15:11