Efficient arithmetic mean calculation in Python on small and large values

Question

This is not a question about how to calculate averages in Python, but a question of how to balance precision and speed when comparing the means of two lists of numbers.

This problem was framed in terms of student's grades, so 'typical' inputs to compare were like [98, 34, 80] and [87, 65, 90, 87]. However I came up against test cases that clearly involved very large numbers as I was getting OverflowError on a return float(average) on occasion.

There are tests cases like the following, for which using float() returns the incorrect answer:

x = [9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
     9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
     9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999]
y = [9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
     9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
     9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998]

The averages of x and y are very close, but not equal. From what I can see the only way to get the right answer is to use Decimal or Fraction, but these are slower.

Here's a quick performance analysis.

def mean_fractions(nums):
    return Fraction(sum(nums), max(len(nums), 1))

def mean_builtins(nums):
    return sum(nums) / float(max(len(nums), 1))

def mean_decimal(nums):
    return Decimal(sum(nums)) / max(len(nums), 1)

# test runner
@timeit
def do_itt(func, input, times):
    for i in range(times):
        func(input)

do_ittt(mean_builtins, y, 1000000) # took: 0.9550 sec
do_ittt(mean_decimal, y, 1000000) # took: 3.0867 sec
do_ittt(mean_fractions, y, 1000000) # took: 3.2718 sec

do_ittt(mean_builtins, [96, 43, 88], 1000000) #  took: 0.7679 sec
do_ittt(mean_decimal, [96, 43, 88], 1000000) # took: 1.4871 sec
do_ittt(mean_fractions, [96, 43, 88], 1000000) # took: 2.6341 sec

We can see that using the builtins offers a significant speed-up, even ignoring that if you want to final result to be a float you need to convert the Decimal and Fraction objects.

Question

So my question is, given these speed differences, is there a good way to know when the builtins approach would suffice for some lists a and b, and when it would provide the wrong answer? On the above x and y it says they're equal which is wrong, but on [96, 43, 88] and [87, 50] it works fine.

Do you actually have to deal with this? It would be considered unusual and specialized for any statistical software to even bother with this. Who even has measurements accurate to 112 decimal places, anyway? — user2357112, Jul 29 '17 at 01:42
Also, the average of 0 numbers is not 0. There is no average of a sample of 0 numbers, and a ZeroDivisionError would be an entirely appropriate result if you tried to take that average. — user2357112, Jul 29 '17 at 01:51
Floats are only useful for values that can be held in a float (up to plus or minus 2^24-1). But here, why don't you use integers to calculate the sum and only resort to float for the final average? — Rudy Velthuis, Jul 29 '17 at 09:36
@RudyVelthuis No, plus/minus 2 ^ **53**. And isn't your suggestion what their `return float(average)` does? Or what their `mean_builtins` does? — Stefan Pochmann, Jul 29 '17 at 11:01
@Stefan: Ah yes, I forgot Python floats are double precision. — Rudy Velthuis, Jul 29 '17 at 12:08
@user2357112 I had to deal with it in a Goldman Sachs issued Hackerrank question. — thundergolfer, Jul 30 '17 at 00:35
@thundergolfer Is it publicly available at HackerRank? Or just for those interviewees? — Stefan Pochmann, Jul 30 '17 at 00:36
@StefanPochmann Not publicly available I'm afraid. Input was list of tuples of (student_name, grade) and output had to be the best average grade rounded down to nearest integer. All the complexity from my perspective was in the floating point arithmetic accuracy issues. — thundergolfer, Jul 30 '17 at 00:41
@thundergolfer Um... uh... if you shall round down to nearest integer anyway... then why are you getting floats involved *at all*? — Stefan Pochmann, Jul 30 '17 at 01:08
@thundergolfer I mean, why not just `sum(nums) // max(len(nums), 1)`? — Stefan Pochmann, Jul 30 '17 at 01:32
@StefanPochmann it is wrong for the `x`, `y` comparison case above. — thundergolfer, Jul 30 '17 at 01:46
Ah, ok. So while you shall report rounded averages, they're not what counts. Then for the comparison, you don't even need to compute the averages. Instead of comparing sum1/len1 with sum2/len2, you could compare sum1*len2 with sum2*len1. — Stefan Pochmann, Jul 30 '17 at 02:19
Where do you get these large numbers from? Possibly they are only unint64 values or are some numbers really larger than 2^64? Do you only have unsigned integer numbers here? Your implementation (summing up and then deviding by len(numbers) is far from optimal. Instead you have to calculate the integer division and modulus for each number and then summing up the integers and add sum(moduli)/len(nums). — max9111, Jul 31 '17 at 17:09
@max9111 sorry, can you explain this further: " Instead you have to calculate the integer division and modulus for each number and then summing up the integers and add sum(moduli)/len(nums)." What kind of modulo operation? How is this an average calculation. — thundergolfer, Aug 01 '17 at 01:50

John Zwinck · Answer 1 · 2017-07-30T03:07:06.607

2

Assume the original scores are always integers. Python float is a 64-bit IEEE 754 floating point number. This can represent any integer with 15 or fewer digits in base 10, or more precisely, it can represent any integer up to 9,007,199,254,740,993.

So if the sum of your scores is more than that, you are likely to have issues using float in the way you have outlined.

You are also likely to have issues if, as Stefan Pochmann points out in the comments below, you have large sums but not quite that large:

6755399441055745.0 / 3 == 6755399441055746.0 / 3 # True

So you need to stay underneath the limit of 15 significant digits for the result of the division as well. If you divide a 15 digit number by 3 you "lose" one digit because the division may not make the integer part have fewer digits, and it demands an extra digit for the fractional part. This may mean that a single "spare" significant digit is enough, but even that may not suffice (I haven't tested it). But certainly, you'd want to use a higher-precision type if the sum of scores is 1 quadrillion or higher.

edited Jul 30 '17 at 03:07

answered Jul 29 '17 at 01:43

John Zwinck

239,568
38
324
436

@AryaMcCarthy [My math is correct](http://www.wolframalpha.com/input/?i=-10%5E100+%3C+9,007,199,254,740,993). Do you think my "^" means xor instead of power, or did you overlook the minus sign? – Stefan Pochmann Jul 29 '17 at 03:33
1

@StefanPochmann: that is nitpicking. Obviously the range -9,007,199,254,740,993 up to 9,007,199,254,740,993 is meant. And in that range, -10^100 does not fit. – Rudy Velthuis Jul 29 '17 at 09:29
1

@StefanPochmann: obviously, a student with such a low score would be a total dimwit and could forget his study anyway. – Rudy Velthuis Jul 29 '17 at 09:30
@RudyVelthuis I know it's nitpicking, almost said that myself. Though I did have a slightly more important point, and you just made it. When you say "range -9,007,199,254,740,993 up to 9,007,199,254,740,993", that's wrong or at least misleading. Because that wording really suggests that -9,007,199,254,740,993 is included. Which is wrong, since that is **not** representable (just like 9,007,199,254,740,993 isn't). – Stefan Pochmann Jul 29 '17 at 10:19
1

I used the values posted. I did not check them. So it is `-9,007,199,254,740,992 up to and including 9,007,199,254,740,992`. These are the exactly representable integers. – Rudy Velthuis Jul 29 '17 at 12:13
@RudyVelthuis No, those are not "the" exactly representable integers. Because that means there are no others. Which is wrong, since there are many others. – Stefan Pochmann Jul 29 '17 at 15:28
They are exactly representable without "gaps". – Rudy Velthuis Jul 29 '17 at 16:53
@John Zwinck it doesn't seem to me an issue with the sum of the grades, but with how accurately Python's floating point numbers can represent the fractional representation of those grades' mean. Staying within that given range, I think you could find two sets of numbers for which floating point arithmetic would produce the wrong answer. – thundergolfer Jul 30 '17 at 00:38
@thundergolfer Yeah, I just remembered that you can have different fractions where both numerator and denominator are integers well within this range, but their `float` values are the same. For example `(1e9 / (1e9 + 1) == (1e9 + 1) / (1e9 + 2))` is true. – Stefan Pochmann Jul 30 '17 at 02:27
@StefanPochmann: Can you find an example where the denominators are in a reasonable range for OP? Say, less than 100,000? – John Zwinck Jul 30 '17 at 02:30
1

@JohnZwinck Hey, that takes just 20 GB or so, and even my PC has 32 GB RAM :-). But no, I don't yet know examples with much smaller denominators. I did write about [**the x/(x+1) == (x-1)/x case**](https://discuss.leetcode.com/topic/12877/20-line-c-o-n-2-hashing-solution/15) a while back, maybe that can help... – Stefan Pochmann Jul 30 '17 at 02:38
2

@JohnZwinck `6755399441055745.0 / 3 == 6755399441055746.0 / 3` – Stefan Pochmann Jul 30 '17 at 03:00

Efficient arithmetic mean calculation in Python on small and large values

1 Answers1