This is not a question about how to calculate averages in Python, but a question of how to balance precision and speed when comparing the means of two lists of numbers.
This problem was framed in terms of student's grades, so 'typical' inputs to compare were like [98, 34, 80]
and [87, 65, 90, 87]
. However I came up against test cases that clearly involved very large numbers as I was getting OverflowError
on a return float(average)
on occasion.
There are tests cases like the following, for which using float()
returns the incorrect answer:
x = [9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999]
y = [9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999,
9999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999999998]
The averages of x
and y
are very close, but not equal. From what I can see the only way to get the right answer is to use Decimal
or Fraction
, but these are slower.
Here's a quick performance analysis.
def mean_fractions(nums):
return Fraction(sum(nums), max(len(nums), 1))
def mean_builtins(nums):
return sum(nums) / float(max(len(nums), 1))
def mean_decimal(nums):
return Decimal(sum(nums)) / max(len(nums), 1)
# test runner
@timeit
def do_itt(func, input, times):
for i in range(times):
func(input)
do_ittt(mean_builtins, y, 1000000) # took: 0.9550 sec
do_ittt(mean_decimal, y, 1000000) # took: 3.0867 sec
do_ittt(mean_fractions, y, 1000000) # took: 3.2718 sec
do_ittt(mean_builtins, [96, 43, 88], 1000000) # took: 0.7679 sec
do_ittt(mean_decimal, [96, 43, 88], 1000000) # took: 1.4871 sec
do_ittt(mean_fractions, [96, 43, 88], 1000000) # took: 2.6341 sec
We can see that using the builtins offers a significant speed-up, even ignoring that if you want to final result to be a float
you need to convert the Decimal
and Fraction
objects.
Question
So my question is, given these speed differences, is there a good way to know when the builtins
approach would suffice for some lists a
and b
, and when it would provide the wrong answer? On the above x
and y
it says they're equal which is wrong, but on [96, 43, 88]
and [87, 50]
it works fine.