8

When optimising slow parts of my code, I was surprised by the fact that A.sum() is almost twice as fast as A.max():

In [1]: A = arange(10*20*30*40).reshape(10, 20, 30, 40)

In [2]: %timeit A.max()
1000 loops, best of 3: 216 us per loop

In [3]: %timeit A.sum()
10000 loops, best of 3: 119 us per loop

In [4]: %timeit A.any()
1000 loops, best of 3: 217 us per loop

I had expected that A.any() would be much faster (it should need to check only one element!), followed by A.max(), and that A.sum() would be the slowest (sum() needs to add numbers and update a value every time, max needs to compare numbers every time and update sometimes, and I thought adding should be slower than comparing). In fact, it's the opposite. Why?

gerrit
  • 24,025
  • 17
  • 97
  • 170
  • 3
    There is a related post about `any()`: http://stackoverflow.com/questions/17128116/why-is-numpy-any-so-slow-over-large-arrays?rq=1 – EdChum Mar 13 '14 at 22:21
  • 3
    I'm curious, why would you think `max` should be faster than `sum`? I could understand expecting both to be equally fast, or `sum` faster than `any`. –  Mar 13 '14 at 22:21
  • @delnan Because for `sum`, for every entry, it needs to (1) add two numbers (2) update a value, whereas for `max`, for every entry, it needs to (1) compare two numbers and (2) sometimes update a value. I thought adding should be slower than comparing... – gerrit Mar 13 '14 at 22:23
  • 4
    Conditional expressions are often slower than arithmetic ones because of the way CPUs are built : http://stackoverflow.com/questions/9820319/why-is-a-cpu-branch-instruction-slow – F.X. Mar 13 '14 at 22:24
  • 4
    @gerrit: Adding and comparing are typically the same speed, while _branching_ (required by max but not sum) is far, far slower. In particular, branch mispredictions might cost as much as 60 additions. – Charles Mar 13 '14 at 22:25
  • @gerrit Thanks for explaining. One reason that doesn't even pan out in theory (i.e. aside from branches being relatively slow in practice), is that "compare, if greater then ..." is *two* instructions, a comparisons and a conditional jump. –  Mar 13 '14 at 22:28
  • 1
    Related: [Why is processing a sorted array faster than an unsorted array?](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array). – Bakuriu Mar 13 '14 at 22:28
  • Interesting, that doing the same timings on float data gives different results. – Blaz Bratanic Mar 13 '14 at 23:29
  • Oddly enough, I'm seeing the opposite of this on my machine. Max is marginally faster than sum. This is the same regardless of how the array is sorted. I'd bet that the speed difference here depends primarily on how NumPy is compiled and on the hardware you are using. – IanH Mar 14 '14 at 05:18
  • Hmm I just noticed that my test case was somewhat different from yours. I tried this on `np.arange(1000000, dtype=np.float64)`, `np.arange(99999, -1, -1, dtype=np.float64)`, `np.random.rand(1000000)`. When using the same test case, I get similar results, regardless of how the array is sorted. – IanH Mar 14 '14 at 05:32

1 Answers1

1

max has to store a value, continuously checking for potential updates (and the CPU needs to do branche operations to effect these). sum just churns through the values.

So sum will be quicker.

P45 Imminent
  • 8,319
  • 4
  • 35
  • 78
  • 2
    No. It's quicker because it doesn't have to branch. It's well known that modern processor perform much better when there are no failures in branch prediction. `sum()` completely avoid any branches and thus runs at full speed. – Bakuriu Mar 13 '14 at 22:27
  • @Bakuriu try with a sorted list (no branch prediction failures) to check that claim – wim Mar 13 '14 at 22:29
  • 1
    @wim Unless reshaping affects iteration order (which I doubt), the array in the question *is* sorted (in ascending order, specifically) as it comes out of `arange`. –  Mar 13 '14 at 22:37
  • Hmm, didn't notice that. But how can we know `A.any` is implemented to traverse the array in the same direction as `A.flat`? – wim Mar 13 '14 at 23:35