4

So I'm making a Conway's Game of Life in Python 3 and I have this function called updateboard that gives birth and kills cells based on their neighbor count (from 0 to 8) stored in self.neighbors. The function looks like this:

def updateboard(self):
        """
        Updates the board state
        """
        alive = np.zeros((self.cellsize, self.cellsize), dtype=np.bool)     # board state for next loop
        
        # iterate cells
        for y in range(self.cellsize):
            for x in range(self.cellsize):
                if self.neighbors[x, y] > 0:    # only look at cells with more than 1 neighbors
                    if self.board[x, y]:        # cell is currently alive
                        alive[x, y] = True if self.neighbors[x, y] in (2, 3) else False
                    else:                       # cell is currently dead
                        alive[x, y] = True if self.neighbors[x, y] == 3 else False
        
        # update states
        self.updateneighbors(self.board, alive)
        self.board = alive

To avoid redundant checks, I am checking whether self.neighbors at that cell is greater than 0 before deciding whether the cell lives or dies.

I was trying out different things to optimize this function and I found out that changing if self.neighbors[x, y] > 0 to if self.neighbors[x, y] significantly sped up the function.

Running the python profiler shows how this one change made the function almost 6 times faster.

Before

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   459   12.465    0.027   13.916    0.030 logic.py:55(updateboard)

After

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   460    1.619    0.004    3.067    0.007 logic.py:55(updateboard)

I tried looking for an explanation online and found many similar questions but haven't managed to find an answer to this specific question yet. I am both very confused and surprised how this one small change made such a difference and would greatly appreciate it if someone could help explain this to me.

DYZ
  • 55,249
  • 10
  • 64
  • 93
Orbital
  • 565
  • 3
  • 13
  • If you are asking questions about implementaion details you should always name the interpreter and version. – Klaus D. May 29 '21 at 05:06
  • thanks. I updated my post to specify what version of python I am using – Orbital May 29 '21 at 05:11
  • Your question boils down to "why is `if x:` faster than `if x>0:`." Always ask the smallest possible question - smaller questions are easier to understand and answer. – DYZ May 29 '21 at 05:34

3 Answers3

4

You're working with NumPy arrays by iterating over their elements one at a time instead of pushing the work into optimized NumPy C-level code. This is not taking advantage of the power of NumPy at all. It's like dragging your car behind you by hand as you walk to the store, instead of driving it.

NumPy is not designed for single-element operations. It supports them, but under the assumption that single-element operations will be rare. NumPy has massive overhead for most operations on single elements, such as >. It needs to do a ton of dispatching, type conversion, and wrapper object allocation that doesn't happen when working with ordinary Python lists and scalars.

Simple experiment shows that this overhead is completely unlike what happens with ordinary Python ints:

In [1]: x = 1

In [2]: %timeit if x: pass
26.1 ns ± 0.0316 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: %timeit if x > 1: pass
35.6 ns ± 0.315 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: import numpy

In [5]: x = numpy.int64(1)

In [6]: %timeit if x: pass
26.8 ns ± 0.0422 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [7]: %timeit if x > 1: pass
203 ns ± 7.63 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The extra > 1 introduces much less overhead for the int than the numpy.int64.

Note that the implicit boolean interpretation involved in if x is a much simpler operation for NumPy than x > 1. There is no second argument to dispatch on or convert, and the C-level nb_bool hook doesn't need to create a wrapper object. NumPy can perform this operation almost as fast as an ordinary Python scalar can.


To really speed up your code, you should work in terms of arrays rather than elements:

alive = (self.neighbors == 3) | ((self.neighbors == 2) & self.board)

NumPy will then be able to loop over the arrays' buffers directly at C level, resulting in much lower overhead.

user2357112
  • 260,549
  • 28
  • 431
  • 505
0

Note that this behaviour might change depending on the Python interpreter or CPU architecture you are using.

In general, x86 CPUs has a special bit that is checked whether a value is zero after arithmetic operation, called Zero-Flag. This is used when you want equality checks, e.g.:

if x == 3

In assembly, it would be:

mov ecx, $1
mov edx, 3
cmp ecx, edx
je equal
; if it did not jump to the label equal,
; then this means ecx and edx are not equal.
equal:
; if it jumped here, then this means ecx and edx are equal

For detail, see.

if x < 3:

It would produce a very similar assembly, as there are specific X86 instructions for inequality compares, and the Negative Flag which it uses under the hood.

For detail, see

But as I noted, it depends on the interpreter what assembly will be executed and the processor architecture. Note that, even the processors internal optimization can have huge effect, e.g.: Branch-Prediction, see: Why is processing a sorted array faster than processing an unsorted array?

So the general answer is that: unless you check what is the exact assembly that gets executed, you'll never know what really happens there, and even if you had the assembly code, it will not probably be the thing that is executed. All in all, you should not bother with this kind of performance issues, neither should you change your code because of it. If you have such performance issues, try upgrading your interpreter. If low-level performance is a huge factor, maybe you should not use Python, but change to another language, where you can have more direct way to change low-level behaviour (e.g.: C/C++).

If you are interested about how different a CPU might execute the code you think that you have provided, there is a very good presentation in this topic by Herb Sutter. You might be interested especially in the first part of the presentation:

https://www.youtube.com/watch?v=A8eCGOqgvH4

https://www.youtube.com/watch?v=KeLBd2EJLOU

Peter Lang
  • 357
  • 1
  • 5
0

The explicit comparison requires two extra operations: loading the constant 0 (LOAD_CONST) and comparing it to your expression (COMPARE_OP).

dis.dis('if x>0: pass')
  1           0 LOAD_NAME                0 (x)
              2 LOAD_CONST               0 (0)
              4 COMPARE_OP               4 (>)
              6 POP_JUMP_IF_FALSE        8
        >>    8 LOAD_CONST               1 (None)
             10 RETURN_VALUE

dis.dis('if x: pass')
  1           0 LOAD_NAME                0 (x)
              2 POP_JUMP_IF_FALSE        4
        >>    4 LOAD_CONST               0 (None)
              6 RETURN_VALUE

The operation POP_JUMP_IF_FALSE takes the jump if there is a boolean False at the top of the stack.

DYZ
  • 55,249
  • 10
  • 64
  • 93