Numpy: Why is arr.mean() slower than arr.sum()/arr.size?

Question

I have measured the running time of two equivalent functions that calculates mean of a numpy's array (1D and 2D) using timeit module:

>>> setup = 'import numpy as np;a=np.random.randint(100, size=(100,100));b=np.random.randint(100, size=1000)'
>>> timeit.timeit(setup=setup, stmt='a.mean()')
13.513522000001103
>>> timeit.timeit(setup=setup, stmt='a.sum()/a.size')
6.080089200000657
>>> timeit.timeit(setup=setup, stmt='b.mean()')
5.404982399999426
>>> timeit.timeit(setup=setup, stmt='b.sum()/b.size')
2.261378399998648

Surprisingly, numpy.ndarray.mean method is slower than numpy.ndarray.sum()/numpy.ndarray.size regardless of the size of the array.

Can anybody explain this?Thanks in advance!

Just guessing, it could be that `mean` performs the sums out of order to minimize the total error. — Mark Ransom, Jul 05 '22 at 00:52
@MarkRansom Well, yes and no, both do a basic reduction. However, since `np.sum` and `np.mean` do not operate on the same data type internally here, the former does a SIMD-friendly integer naive reduction (basic loop) while the later does a SIMD-friendly pairwise floating-point reduction. In practice, both should be executed out-of-order. The algorithm is different and the later is a bit less in-order. `np.sum` of floating-point items applies exactly the same reduction algorithm (pairwise summation). — Jérôme Richard, Jul 05 '22 at 10:47

Jérôme Richard · Answer 1 · 2022-07-05T10:41:29.260

4

np.sum and np.mean do operate on different native data types internally. Indeed, np.mean converts all items to a np.float64 type internally and so create an expensive new temporary array. np.sum operate directly on np.int32 integers (which can be computed more efficiently on mainstream x86-64 processors). To mimic the behaviour of np.mean, you can specify the accumulator/output type with b.sum(dtype=np.float64)/b.size. The resulting performance is much closer to np.mean. For more information about this, please consider reading this post and this one. Note that np.sum can suffer from overflows if the integers are big (resulting in completely wrong results as opposed to np.mean).

Additionally, np.mean has a higher (constant) overhead due to the way it is implemented internally (it does a generic reduction and there are more check done and functions called). This overhead is inherent to nearly all Numpy function but it can be significantly bigger for some functions. hHis part can be optimized and we worked a bit on reducing it previously but a lot of code need to be modified to make this faster and it is not a critical point so far (Numpy is not designed to operate on very small arrays and can hardly be because of CPython causing a lot of checks to be done). If you set the array size to 200_000, then the execution time of b.mean() and b.sum(dtype=np.float64)/b.size should be very close (this is the case on my machine).

edited Jul 05 '22 at 10:41

answered Jul 05 '22 at 10:37

Jérôme Richard

41,678
6
29
59

I suspect you wanted and forgot to link a post under "this post". I'd be interested in it myself! – Dominik Stańczak Jul 05 '22 at 10:41
1

@DominikStańczak Yeah I just so this when it was posted. I added it quickly but not enough ;) . Thank you. – Jérôme Richard Jul 05 '22 at 10:49
@JérômeRichard, Thank you for your answer. I need your advice on this. I have a boolean array and need to calculate accuracy score out of it. Which approach should I use: `np.count_nonzero(self.predict(X) == y)/y.size` or `(self.predict(X) == y).mean()`. – Gold_Leaf Jul 05 '22 at 17:13
@Gold_Leaf Well `np.count_nonzero` was supposed to be efficiently implemented but it turns out it is pretty slow in my (not up to date) Numpy. Results show it does not use SIMD instructions (while there are SIMD functions for that in Numpy...). I need to check/fix the code. `np.mean` will be pretty inefficient in this case too because each 1 byte booleans will be transformed to a 8 byte double. `np.sum` is pretty good, especially in the last version of Numpy because I optimized it a bit so to use SIMD instructions. It is still suboptimal because of the implicit type conversion. – Jérôme Richard Jul 05 '22 at 17:41
@Gold_Leaf If you know that the number of ones is very tiny and the array is quite big, then you can use `np.where(y)[0].size/y.size`. It seems crazy but this is the most aggressively optimized method for that in Numpy yet (but pretty slow if there are a lot of ones). If the number of one is small and bounded by 255, then you can cheat with `np.sum(v.view(np.uint8), dtype=np.uint8)`. For other cases, there are tricks but I think it is better for you to use `np.sum` with a relatively small `dtype`, or use `np.count_nonzero` and wait an improved future version of Numpy. – Jérôme Richard Jul 05 '22 at 17:48
1

@Gold_Leaf `np.count_nonzero` appears to be the fastest version with a recent version of Numpy and the type `np.bool_` and it looks like it properly used SIMD instruction. Note that it is surprisingly inefficient for `uint8`/`int8`/`uint16`/`int16` on Windows yet but the code looks fine... **Put it shortly**: please use `np.count_nonzero` with `np.bool_` and a recent Numpy :) – Jérôme Richard Jul 05 '22 at 17:59

Numpy: Why is arr.mean() slower than arr.sum()/arr.size?

1 Answers1