7

Solving a recent Advent of Code problem, I found my default Python was ~40x slower than PyPy. I was able to get that down to about 17x with this code by limiting calls to len and limiting global lookups by running it in a function.

Right now, e.py runs in 5.162 seconds on python 3.6.3 and .297 seconds on PyPy on my machine.

My question is: is this the irreducible speedup of JIT, or is there some way to speed up the CPython answer? (short of extreme means: I could go to Cython/Numba or something?) How do I convince myself that there's nothing more I can do?


See the gist for the list of numbers input file.

As described in the problem statement, they represent jump offsets. position += offsets[current], and increment the current offset by 1. You're done when a jump takes you outside the list.

Here's the example given (the full input that takes 5 seconds is much longer, and has larger numbers):

(0) 3  0  1  -3  - before we have taken any steps.
(1) 3  0  1  -3  - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
 2 (3) 0  1  -3  - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
 2  4  0  1 (-3) - jump all the way to the end; leave a 4 behind.
 2 (4) 0  1  -2  - go back to where we just were; increment -3 to -2.
 2  5  0  1  -2  - jump 4 steps forward, escaping the maze.

The code:

def run(cmds):
    location = 0
    counter = 0
    while 1:
        try:
            cmd = cmds[location]
            if cmd >= 3:
                cmds[location] -= 1
            else:
                cmds[location] += 1
            location += cmd
            if location < 0:
                print(counter)
                break
            counter += 1
        except:
            print(counter)
            break

if __name__=="__main__":
    text = open("input.txt").read().strip().split("\n")
    cmds = [int(cmd) for cmd in text]
    run(cmds)

edit: I compiled and ran the code with Cython, that dropped runtime to 2.53s, but that's still almost an order of magnitude slower than PyPy.

edit: Numba gets me to within 2x

edit: The best Cython I could write got down to 1.32s, a little over 4x pypy

edit: Moving the cmd variable into a cdef, as suggested by @viraptor, got the Cython down to .157 seconds! Nearly a full order of magnitude faster, and not that far from regular python. Still, I come away from this extremely impressed with the PyPy JIT, which did all this for free!

trinaldi
  • 2,872
  • 2
  • 32
  • 37
llimllib
  • 3,642
  • 1
  • 29
  • 29
  • 1
    From the problem statement: *An urgent interrupt arrives from the CPU: it's trapped in a maze of jump instructions,* Obviously the glib answer is: Don't write IRQ handlers in Python :P – Peter Cordes Dec 06 '17 at 04:11
  • The reason I ask is not because this is a good idea of course, but it's the greatest pypy -> python disparity that I've seen in practice, so I'm curious about it – llimllib Dec 06 '17 at 04:13
  • 1
    I'm not very familiar with Python (or the performance of different implementations) but yeah I'd expect that JIT will help a *lot* for a small CPU-bound loop like this. Looks like an interesting question to me; upvoted. Probably you should give a quick summary of what the problem is, instead of just an external link. – Peter Cordes Dec 06 '17 at 04:15
  • 2
    Can you put the `try` around the loop, so an interpreter doesn't have to enter a new `try` block every time through the loop? – Peter Cordes Dec 06 '17 at 04:15
  • 2
    At the very least, go through this and make sure you check off [the well known](https://wiki.python.org/moin/PythonSpeed) But this belongs on CodeReview – juanpa.arrivillaga Dec 06 '17 at 04:18
  • 1
    @PeterCordes moving the try out of the loop gives a minor bump, runtime down to 4.8s – llimllib Dec 06 '17 at 04:19
  • In particular, look [here](https://wiki.python.org/moin/PythonSpeed/PerformanceTips) – juanpa.arrivillaga Dec 06 '17 at 04:23
  • And if you are going to use `numba` you might as well use `numpy`. That combinations is pretty fast. – juanpa.arrivillaga Dec 06 '17 at 04:25
  • @juanpa.arrivillaga mostly I used numba to answer the question for myself: is the jit the reason for the more-than-order-of-magnitude difference? And the answer definitely seems to be yes. But I'd like to understand this in more detail! – llimllib Dec 06 '17 at 04:26
  • Where does the `if cmd >= 3` condition come from? I don't see it mentioned in the problem statement, just always increment. – Peter Cordes Dec 06 '17 at 04:39
  • @PeterCordes that's a stipulation of part 2, which you have to complete part 1 to unlock – llimllib Dec 06 '17 at 04:39
  • Is it much faster without the `if`? Less interpreter overhead, but the JIT version is probably faster too because of fewer branch mispredicts (unless it emits branchless asm). – Peter Cordes Dec 06 '17 at 04:42
  • What if you pull `if location < 0:` out of the loop as well, and let bounds-checking handle that? One unsigned compare can catch negative and large positive `positions`, so either do it yourself or let the interpreter do it, but don't do both. – Peter Cordes Dec 06 '17 at 04:43
  • @PeterCordes if I pulled location out, then the code wouldn't fail on negative indexes because python supports negative indexes, so I'd get the wrong answer – llimllib Dec 06 '17 at 04:45
  • Ah right, it wraps around to the end of the container. Oops! Is there any way you can turn the check into an unsigned compare against `cmds.size` or whatever it's called, so negative gets treated as a large positive? (And leave out the `try`). Maybe that would end up having more interpreter overhead and not even be worth it, though. But it's what you'd do in asm :P – Peter Cordes Dec 06 '17 at 04:46
  • @juanpa.arrivillaga as far as I can tell, no suggestion on that page is relevant to this code. I tried replacing `[` with `__getattribute__` and `__setattribute__` before I had posted this question, and it does not speed anything up AFAICT – llimllib Dec 06 '17 at 04:51
  • @llimllib did you "inline" those method calls? – juanpa.arrivillaga Dec 06 '17 at 05:07
  • Not sure what you mean @juanpa.arrivillaga – llimllib Dec 06 '17 at 05:20
  • @llimllib [avoiding dots](https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Avoiding_dots...) – juanpa.arrivillaga Dec 06 '17 at 05:26
  • oh yeah, that is how I implemented that @juanpa.arrivillaga – llimllib Dec 06 '17 at 05:28

2 Answers2

9

As a baseline for Python, I wrote this in C (and then decided to use C++ for more convenient array I/O). It compiles efficiently for x86-64 with clang++. This runs 82x faster than CPython3.6.2 with the code in the question, on a Skylake x86, so even your faster Python versions are still a couple factors away from keeping up with near-optimal machine code. (Yes, I looked at the compiler's asm output to check that it did a good job).

Letting a good JIT or ahead-of-time compiler see the loop logic is key for performance here. The problem logic is inherently serial, so there's no scope for getting Python to run already-compiled C to do something over the whole array (like NumPy), because there won't be compiled C for this specific problem unless you use Cython or something. Having each step of the problem go back to the CPython interpreter is death for performance, when its slowness isn't hidden by memory bottlenecks or anything.


Update: transforming the array of offsets into an array of pointers speeds it up by another factor of 1.5x (simple addressing mode + removing an add from the critical path loop-carried dependency chain, bringing it down to just the 4 cycle L1D load-use latency for a simple addressing mode (when the pointer comes from another load), not 6c = 5c + 1c for an indexed addressing mode + add latency).

But I think we can be generous to Python and not expect it to keep up with algorithm transformations to suit modern CPUs :P (Especially because I used 32-bit pointers even in 64-bit mode to make sure the 4585 element array was still only 18kiB so it fits in the 32kiB L1D cache. Like the Linux x32 ABI does, or the AArch64 ILP32 ABI.)


Also, an alternate expression for the update gets gcc to compile it branchless, like clang does. (Commented out and original perf stat output left in this answer, because it's interesting the see the effect of branchless vs. branchy with mispredicts.)

unsigned jumps(int offset[], unsigned size) {
    unsigned location = 0;
    unsigned counter = 0;

    do {
          //location += offset[location]++;            // simple version
          // >=3 conditional version below

        int off = offset[location];

        offset[location] += (off>=3) ? -1 : 1;       // branchy with gcc
        // offset[location] = (off>=3) ? off-1 : off+1;  // branchless with gcc and clang.  

        location += off;

        counter++;
    } while (location < size);

    return counter;
}

#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    std::ios::sync_with_stdio(false);     // makes cin faster
    std::istream_iterator<int> begin(std::cin), dummy;
    std::vector<int> values(begin, dummy);   // construct a dynamic array from reading stdin

    unsigned count = jumps(values.data(), values.size());
    std::cout << count << '\n';
}

With clang4.0.1 -O3 -march=skylake, the inner loop is branchless; it uses a conditional-move for the >=3 condition. I used ? : because that's what I was hoping the compiler would do. Source + asm on the Godbolt compiler explorer

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    mov     ebx, edi               ; silly compiler: extra work inside the loop to save code outside
    mov     esi, dword ptr [rax + 4*rbx]  ; off = offset[location]
    cmp     esi, 2
    mov     ecx, 1
    cmovg   ecx, r8d               ; ecx = (off>=3) ? -1 : 1;  // r8d = -1 (set outside the loop)
    add     ecx, esi               ; off += -1 or 1
    mov     dword ptr [rax + 4*rbx], ecx  ; store back the updated off
    add     edi, esi               ; location += off  (original value)
    add     edx, 1                 ; counter++
    cmp     edi, r9d
    jb      .LBB1_4                ; unsigned compare against array size

Here's the output of perf stat ./a.out < input.txt (for the clang version), on my i7-6700k Skylake CPU:

21841249        # correct total, matches Python

 Performance counter stats for './a.out':

         36.843436      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               119      page-faults               #    0.003 M/sec                  
       143,680,934      cycles                    #    3.900 GHz                    
       245,059,492      instructions              #    1.71  insn per cycle         
        22,654,670      branches                  #  614.890 M/sec                  
            20,171      branch-misses             #    0.09% of all branches        

       0.036953258 seconds time elapsed

The average instructions-per-clock is much lower than 4 because of the data dependency in the loop. The load address for the next iteration depends on the load+add for this iteration, and out-of-order execution can't hide that. It can overlap all the work of updating the value of the current location, though.

Changing from int to short has no performance downside (as expected; movsx has the same latency as mov on Skylake), but halves the memory consumption so a larger array could fit in L1D cache if needed.

I tried compiling the array into the program (as int offsets[] = { file contents with commas added }; so it doesn't have to be read and parsed. It also made the size a compile-time constant. This reduced the run time to ~36.2 +/- 0.1 milliseconds, down from ~36.8, so the version that reads from a file was still spending most of its time on the actual problem, not parsing the input. (Unlike Python, C++ has negligible startup overhead, and my Skylake CPU ramps up to max clock speed in well under a millisecond thanks to hardware P-state management in Skylake.)


As described earlier, pointer-chasing with a simple addressing mode like [rdi] instead of [rdi + rdx*4] has 1c lower latency, and avoids the add (index += offset becomes current = target). Intel since IvyBridge has zero-latency integer mov between registers so that doesn't lengthen the critical path. Here's the source (with comments) + asm for this hacky optimization. A typical run (with text parsing into a std::vector): 23.26 +- 0.05 ms, 90.725 M cycles (3.900 GHz), 288.724 M instructions (3.18 IPC). Interestingly it's actually more total instructions, but runs much faster due to the lower latency of the loop-carried dependency chain.


gcc uses a branch and it's about 2x slower. (14% of branches are mispredicted according to perf stat on the whole program. It's only branching as part of updating the values, but branch misses stall the pipeline so they affect the critical path too, in a way that data dependencies don't here. This seems like a poor decision by the optimizer.)

Rewriting the conditional as offset[location] = (off>=3) ? off-1 : off+1; convinces gcc to go branchless with asm that looks good.

gcc7.1.1 -O3 -march=skylake (for the original source that compiles with a branch for (off <= 3) ? : -1 : +1).

Performance counter stats for './ec-gcc':

     70.032162      task-clock (msec)         #    0.998 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           118      page-faults               #    0.002 M/sec                  
   273,115,485      cycles                    #    3.900 GHz                    
   255,088,412      instructions              #    0.93  insn per cycle         
    44,382,466      branches                  #  633.744 M/sec                  
     6,230,137      branch-misses             #   14.04% of all branches        

   0.070181924 seconds time elapsed

vs. CPython (Python3.6.2 on Arch Linux):

perf stat python ./orig-v2.e.py
21841249

 Performance counter stats for 'python ./orig-v2.e.py':

       3046.703831      task-clock (msec)         #    1.000 CPUs utilized          
                10      context-switches          #    0.003 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               923      page-faults               #    0.303 K/sec                  
    11,880,130,860      cycles                    #    3.899 GHz                    
    38,731,286,195      instructions              #    3.26  insn per cycle         
     8,489,399,768      branches                  # 2786.421 M/sec                  
        18,666,459      branch-misses             #    0.22% of all branches        

       3.046819579 seconds time elapsed

I don't have PyPy or other Python implementations installed, sorry.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
2

You can improve small things, but pypy will (most likely) always be faster in this task.

For both CPython and Cython:

  • Don't read in the whole file at once. You can iterate on lines as you go, which saves you (re)allocation costs. This requires you to pre-allocate the array though.
  • Maybe switch from a list to an array.

For Cython:

  • Mark the variable types. Especially the counters as ints and the commands as an array of ints to skip most type checks.
viraptor
  • 33,322
  • 10
  • 107
  • 191
  • reading the file is irrelevant to the runtime, it all happens as a constant cost at the beginning. You're right though – llimllib Dec 06 '17 at 04:24
  • make that 1.3s. Close! Less than 5x speed penalty, but farther and farther from python :) – llimllib Dec 06 '17 at 04:43
  • 2
    @llimllib You can do better: rename `cmd` in the list comprehension, so it doesn't collide with the one in `while`. Then add a `cdef int cmd = 0` before the loop. Otherwise, each `cmd` is constructed as an object. – viraptor Dec 06 '17 at 09:36
  • 2
    @llimllib cython actually beats pypy with all the types supplied: pyx from the gist: `1 loop, best of 3: 1.23 s per loop`, after the modifications from the last comment: `10 loops, best of 3: 68.3 ms per loop` (both compiled with `-O3` for python3.6) – viraptor Dec 06 '17 at 09:45
  • ah good point, can't believe I missed that var! cython down to .157s, very impressive! – llimllib Dec 06 '17 at 18:48