Fastest Cython implementation depends on computer?

Question

I am converting a python script to cython and optimizing it for more speed. Right now i have 2 versions, on my desktop V2 is twice as fast as V1 unfortunately on my laptop V1 is twice as fast as V2 and i am unable to find out why there is such a big difference. Both computers use:
- Ubuntu 16.04
- Python 2.7.12
- Cython 0.25.2
- Numpy 1.12.1
Desktop:
- Intel® Core™ i3-4370 CPU @ 3.80GHz × 4 64bit. 16GB RAM
Laptop:
- Intel® Core™ i5-3210 CPU @ 2.5GHz × 2 64bit. 8GB RAM

V1 - you can find the full code here. the only changes made are renaming go.py, preprocessing.py to go.pyx, preprocessing.pyx and using
import pyximport; pyximport.install() to compile them. you can run test.py. This version is using a 2d numpy array board to store data in go.pyx and list comprehension in the get_board function in preprocessing.pyx to process data. during the test no function is called from go.py only the numpy array board is used

V2 - you can find the full code here. quite some stuff has changed, below you can find a list with everything affecting this test case. Be aware, all function and variable declarations have to be in go.pxd. you can run test.py using this command: python test.py build_ext --inplace
the 2d numpy array is replaced by:

cdef char board[ 362 ]

and the function get_board_feature in go.pyx replaces numpy list comprehension:

cdef char get_board_feature( self, short location ):
    # return correct board feature value
    # 0 active player stone
    # 1 opponent stone
    # 2 empty location

    cdef char value = self.board[ location ]

    if value == EMPTY:
        return 2

    if value == self.player_current:
        return 0

    return 1

get_board function in preprocessing.pyx is replaced with a function that loops over the array and calls get_board_feature in go.pyx for every location

@cython.boundscheck(False)
@cython.wraparound(False)
cdef int get_board(self, GameState state, np.ndarray[double, ndim=2] tensor, int offSet ):
    """A feature encoding WHITE BLACK and EMPTY on separate planes, but plane 0
       always refers to the current player and plane 1 to the opponent
    """

    cdef short location

    for location in range( 0, state.size * state.size ):

        tensor[ offSet + state.get_board_feature( location ), location ] = 1

    return offSet + 3

Please let me know if i should include any other information or run certain tests.

cmp, diff test
the V2 go.c and preprocessing.c files are identical. V1 does not generate a .c file to compare

update compared .so files
the V2 go.so files are different:

goD.so goL.so differ: byte 473, line 1

the preprocessing.so files are identical, not sure what to think of that..

My feeling is that this is too much code for me to dig into (and I suspect other people will feel the same). General advice is: I'd start by trying to profile it (on both systems) and see if you can identify which functions show the big differences in the profiled data. If you can get that down then you could probably construct a smaller test-case just using the function that shows the main difference. — DavidW, Mar 20 '17 at 19:07
The second thing I'd include in the question is details of the CPUs in the two systems. Is one 32 bit and the other 64 bit for example - I could imagine that could make a big difference in terms of preferred data types? — DavidW, Mar 20 '17 at 19:09
Well only the things i mention are being used but i see your point and will create a clean version for readability. Good point about the CPUs i will add that information — MaMiFreak, Mar 20 '17 at 19:56
I'd also double check if the compiled modules produced are identical (`diff` or `cmp`). I imagine they should be, but if not then that could be something to look at. — DavidW, Mar 20 '17 at 20:16
i removed all unused code so code should readable, only have to add an explanation on the `go.py` from V1 as only the numpy array is used. — MaMiFreak, Mar 20 '17 at 20:26

score 0 · Accepted Answer · answered Mar 20 '17 at 21:07

They are two different machines and behave differently. There's a reason why processor reviews use large benchmark suites. It could be said that the desktop CPU performs better on average, but execution times between two small but non-trivial pieces of codes does not 'have' to favor the desktop CPU. And differences execution times definitely do not have to follow any linear relationship. The performance is always dependant on a huge amount of factors. Possible explanations include but are not limited to the smaller L1 and L2 caches on the desktop and the change in vector instruction sets from AVX to AVX2 between the Ivy Bridge laptop and the Haswell desktop.

Generally it's a good idea to concentrate on using good algorithms and to identify and remove bottlenecks when optimizing performance. Trying to stare at benchmarks between different machines will probably only cause a headache.

Thanks! This crossed my mind but the code in question is a simple array lookup, comparison and setting a value in a different array and it seems rather extreme to get a 2X speed difference in favor of V1 or V2 depending on what computer one uses. — MaMiFreak, Mar 21 '17 at 15:38

Fastest Cython implementation depends on computer?

1 Answers1