5

Given that we can easily convert between product of items in list with sum of logarithm of items in list if there are no 0 in the list, e.g:

>>> from operator import mul
>>> pn = [0.4, 0.3, 0.2, 0.1]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.22133638394006433
>>> math.exp(sum(0.25 * math.log(p) for p in pn))
0.22133638394006436

How should we handle cases where there are 0s in the list and in Python (in a programatically and mathematically correct way)?

More specifically, how should we handle cases like:

>>> pn = [0.4, 0.3, 0, 0]
>>> math.pow(reduce(mul, pn, 1), 1./len(pn))
0.0
>>> math.exp(sum(1./len(pn) * math.log(p) for p in pn))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <genexpr>
ValueError: math domain error

Is returning 0 really the right way to handle this? What is an elegant solution such that we considers the 0s in the list but not end up with 0s?

Since it's some sort of a geometric average (product of list) and it's not exactly useful when we return 0 just because there is a single 0 in the list.

Spill over from Math Stackexchange: https://math.stackexchange.com/questions/1727497/resolving-zeros-in-product-of-items-in-list, No answer from the math people, maybe the python/code Jedis have better ideas at resolving this.

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • You might try to use [Loopital](https://en.wikipedia.org/wiki/Division_by_zero) when you divide by zero. If the case is exponent, I'm not sure Loopital will actually help you but you can try it. – Lior Magen Apr 11 '16 at 10:23
  • Surely the right way is to return a NaN. [There are NaNs in Python](http://stackoverflow.com/questions/944700/how-to-check-for-nan-in-python), and `numpy` already excludes `nan`s in its computations. – Akshat Mahajan Apr 13 '16 at 03:54

4 Answers4

6

TL;DR: Yes, returning 0 is the only right way. (But see Conclusion.)

Mathematical background

In real analysis (i.e. not for complex numbers), when logarithms are considered, we traditionally assume the domain of log are real positive numbers. We have the identity:

x = exp(log(x)),   for x>0.

It can be naturally extended to x=0 since the limit of the right hand side expression is well defined at x->0+ and equal to 0. Moreover, it's legit to set log(0)=-inf and exp(-inf)=0 (again: only for real, not complex, numbers). Formally, we extend the set of real numbers adding two elements -inf, +inf and defining consistent arithmetic etc. (For our purposes, we need to have inf + x = inf, x * inf = inf for a real x, inf + inf = inf etc.)

The other identity x = log(exp(x)) is less troublesome and holds for all real numbers (and even x=-inf or +inf).

Geometric mean

The geometric mean can be defined for nonnegative numbers (possibly equal to zeros). For two numbers a, b (it naturally generalizes to more numbers, so I'll be using only two further on), it is

gm(a,b) = sqrt(a*b),   for a,b >= 0.

Certainly, gm(0,b)=0. Taking log, we get:

log(gm(a,b)) = (log(a) + log(b))/2

and it is well defined if a or b is zero. (We can plug in log(0) = -inf and the identity still holds true thanks to the extended arithmetic we defined earlier.)

Interpretation

Not surprisingly, the notion of the geometric mean hails from geometry and was originally (in ancient Greece) used for strictly positive numbers.

Suppose, we have a rectangular with sides of lengths a and b. Find a square with the area equal to the area of the rectangular. Easy to see, that the side of the square is the geometric mean of a and b.

Now, if we take a = 0, then we don't really have a rectangular and this geometric interpretation breaks. Similar problems can arise with other interpretations. We can mitigate it by considering, for example, degenerate rectangulars and squares but it may not always be a plausible approach.

Conclusion

It's up to a user (mathematician, engineer, programmer) how she understands the meaning of a geometric mean being zero. If it causes serious problems with interpretation of the results or breaks a computer program, then in the first place, maybe the choice of the geometric mean was not justified as a mathematical model.


Python

As already mentioned in the other answers, python has infinity implemented. It raises a runtime warning (division by zero) when executing np.exp(np.log(0)) but the result of the operation is correct.

ptrj
  • 5,152
  • 18
  • 31
  • If you don't like the result 0, then you may want to consider a [Hőlder mean](https://en.wikipedia.org/wiki/Generalized_mean). Can elaborate on it later. – ptrj Apr 08 '16 at 20:10
2

Whether or not 0 is the correct result depends on what you're trying to accomplish. ptrj did a great job with their answer, so I will only add one thing to consider.

You may want to consider using an epsilon-adjusted geometric mean. Whereas a standard geometric mean is of the form (a_1*a_2*...*a_n)^(1/n), the epsilon-adjusted geometric mean is of the form ( (a_1+e)*(a_2+e)*...*(a_n+e) )^(1/n) - e. The appropriate value for epsilon (e) depends again on your task.

Epsilon-adjusted geometric means are sometimes used in data retrieval where a 0 in the set shouldn't cause a record's score to vanish entirely, though it should still penalize the record's score. See for example Score Aggregation Techniques in Retrieval Experimentation.

For example, with your data and an epsilon adjustment of 0.01

>>> from operator import mul
>>> pn=[0.4, 0.3, 0, 0]
>>> e=0.01
>>> pow(reduce(mul, [x+e for x in pn], 1), 1./len(pn)) - e
0.04970853116594962
Gabriel
  • 580
  • 4
  • 14
0

You should return -math.inf in python 3.5 or -float('inf') in older versions. This is because the logarithm of numbers very close to 0 goes to negative infinity. This float value with preserve the correct inequalities between the sum of logs between lists, for instance one would expect that

sumlog([5, 4, 1, 0, 2]) < sumlog([5, 1, 4, 0.0001, 1])

This inequality is held if you return negative infinity.

0

You can try using list comprehensions in Python. They can be very powerful for customising the way your data is handled. This example uses list comprehension and an error number of -999.

>>> [math.log(i) if i > 0 else -999 for i in pn]
>>> [-0.916290731874155, -1.2039728043259361, -999, -999]

If you're only using the if and not the else, then the if goes after the for i in pn part.

Onjrew
  • 88
  • 1
  • 7