Why is this improved sieve slower with pypy?

Question

def sieve(n):
    nums = [0] * n
    for i in range(2, int(n**0.5)+1):
        if nums[i] == 0:
            for j in range(i*i, n, i):
                nums[j] = 1

    return [i for i in range(2, n) if nums[i] == 0]

def sieve_var(n):
    nums = [0] * n
    for i in range(3, int(n**0.5)+1, 2):
        if nums[i] == 0:
            for j in range(i*i, n, i):
                nums[j] = 1

    return [2] + [i for i in range(3, n, 2) if nums[i] == 0]

On my machine, sieve(10**8) takes 2.28 s while sieve_var(10**8) takes 2.67 s. I don't think pypy's warmup time is the culprit here, so why isn't sieve_var, which iterates over less, faster? In standard python 3.3 sieve_var is faster as expected. Using pypy 4.0.1 32bit on Windows 8.1.

Edit: As a test, I added count = 0 at the start of the function and count += 1 inside the inner loop (where nums[j] = 1 is). sieve(10**8) counts 242570202 while sieve_var(10**8) counts 192570204. So although the count isn't halved for sieve_var, it is doing less "work".

For fun, here is a version with slice indexing:

def sieve_slice(n):
    sieve = [True] * n
    for i in range(3,int(n**0.5)+1,2):
        if sieve[i]:
            sieve[i*i::2*i]=[False]*((n-i*i-1)//(2*i)+1)
    return [2] + [i for i in range(3,n,2) if sieve[i]]

With python 3.6, sieve_slice runs about 4x faster than sieve, but with pypy3 7.3.0, sieve runs about 2x faster than sieve_slice.

@user2357112 if it's wrong then why does `sieve(10**8) == sieve_var(10**8)`? — qwr, Jun 28 '17 at 20:05
@user2357112 basically `t = time.clock(); print(len(sieve(10**8))); print(time.clock()-t)` — qwr, Jun 28 '17 at 21:34
@user2357112 not sure what you mean but I ran both as separate programs — qwr, Jun 28 '17 at 21:57
I can reproduce the problem. I get around 6.6s for `len(sieve(2*10**8))` and 7s for `len(sieve_var(2*10**8))`. I am using pypy 5.7.1 on 64 bit Windows. — Simd, Jun 29 '17 at 09:23
more brainstorming ideas: when compiled by pypi, the `range` generator may use single `INC` instruction for the increment of 1, but not for 2... or processors may be able to predict the next value of `i` in the simpler situation and correctly guess the branches... — Aprillion, Jul 07 '17 at 07:52

score 10 · Answer 1 · answered Jun 29 '17 at 09:45

10

I'm not sure why it's slightly slower on Windows. On Linux the speed is the same. However, I can answer why we get mostly the same speed. The answer would be the same if the program was written in C, and the answer is purely at the level of the processor. This program is bound on memory I/O accessing the list, which is 400 or 800MB in size. In the second version, you're basically avoiding one extra if nums[i] == 0 check. This extra check costs nothing, though, because the CPU just fetched nums[i - 1] in its caches during the previous iteration and will need nums[i + 1] during the next iteration. The CPU is waiting for the memory anyway.

To verify what I'm saying, try to make the nums array more compact. I tried to access it with nums[i // 2], assuming that i is always odd, and the result was twice faster. You can probably win even more by not using a Python list (stored as an array of 32-bit integers on a 32-bit PyPy), but instead an array of bits (but it's much more code because there is no standard built-in array of bits).

answered Jun 29 '17 at 09:45

Armin Rigo

12,048
37
48

The number of iterations of the inner loop is much smaller in the sieve_var case. Try this code: https://bpaste.net/show/bfbd7e66bdcb – Simd Jun 29 '17 at 09:49
One extra check? Isn't the second version doing half as many checks? – qwr Jun 29 '17 at 09:59
Yes, the second version is in theory doing half as many checks. My point is that it's completely irrelevant because these checks are all in the cache and the benchmark is memory-bound. I'm told it becomes 7 times faster (on 64-bit linux) by using a bytearray instead---and, surprize, that's almost the reduction of memory. – Armin Rigo Jun 29 '17 at 10:14
I don't think it is the correct explanation – I have changed this code to use `bytearray`; now on my home server (Linux) `sieve` and `sieve_var` take 55 and 39s respectively using python3, but 5.29 and 5.53 usiny pypy. So, sieve_var is still slower under pypy. – Błotosmętek Jul 02 '17 at 21:32
Why is it the same speed in pypy but not in python? – qwr Jul 04 '17 at 07:56
1

@qwr if you run pypy without jit-compiler (`pypy --jit off`) you will see that `sieve_var` is faster than `sieve`. I guess it is overhead from the intepreter, which makes the difference. – ead Jul 04 '17 at 11:06
@ead and if you compare `sieve_var` with and without JIT, which one is faster? if it's faster with JIT, I wouldn't call it "overhead", just that the less sophisticated version of the code was easier to optimize – Aprillion Jul 07 '17 at 07:27

ead · Answer 2 · 2017-07-13T20:34:02.040

TL,DR;

As a C-program this would be a memory bound algorithm. However, even jit-compiled pypy-code has considerably more overhead and the operations are no longer "for free". Surprisingly (or maybe not), rhese two sieve versions have different jit-code, it is probably just bad luck that the second version results in a slower code.

If it would be C, @Armin's answer would be the right one. It is well known, that for modern computers/caches and memory-bound code it does not matter if we jump over an integer - nevertheless all values must be fetched from memory and this is a bottle-neck. See this article for a great explanation.

Yet my experiments show that the non-optimized version (sieve) is slightly faster than the optimized version (sieve_var). Timings also shown, that the last line of sieve i.e. [i for i in range(2, n) if nums[i] == 0] is executed faster than the line of sieve_var - return [2] + [i for i in range(3, n, 2) if nums[i] == 0].

On my machine it was0.45 seconds vs. 0.65 seconds for 10^8 elements. These numbers may vary from machine to machine, so it is quite possible that somebody with faster CPU and slower memory would not see any difference at all. If it could be explained in terms of "memory dominates all", then we should be able to see, that the slower version has more cache misses as the faster version.

However, by running valgrind --tool=cachegrind pypy sieveXXX.py we can see, that there is almost no difference in number of cache misses, at least nothing that would explain the observable difference.

Let's consider a slightly simpler version, which exhibits exactly the same behavior - we don't save the primes, but just count them:

def sieve(n):
    ...
    res=0
    for i in range(2, n): 
          if nums[i] == 0:
              res+=1
    return res

def sieve_var(n):
    ...
    res=1
    for i in range(3, n,2): 
          if nums[i] == 0:
              res+=1
    return res

The first version is still faster: 0.35 sec. vs. 0.45 sec (to make sure the time difference is no fluke and not due to some jit-warmup, I put the last part of the code into a for-loop and got always the same timings).

Before going further, let's take a look at a C-implementation and its assembly

long long int sum(long long int *a, int n){
    long long int res=0;
    for(int i=2;i<n;i++)
       if(a[i]==0)
          res++;
    return res;
}

compiled with gcc and -Os we get:

        movl    $2, %edx
        xorl    %eax, %eax
.L4:
        cmpl    %edx, %esi
        jle     .L1
        cmpq    $0, (%rdi,%rdx,8)
        jne     .L3
        incq    %rax
.L3:
        incq    %rdx
        jmp     .L4
.L1:
        ret

Pretty small and straight forward and takes only 0.08 seconds on my machine. My memory can go as fast as 10 GB/s and there are 8*10^8 bytes - so the whole time was basically needed to get the data.

But from this we also see, that the pypy-version has about 0.25 seconds overhead compared to the C-code. Where does it come from? By utilizing the vmprof-module we can see the jit-code and:

for one iteration of the loop, there are many more operations than in the C-version
the versions for sieve and sieve_par are very different. We can use debugger to count the number of instruction of an iteration: 24 for sieve and 76-operations for sieve_var which handles only every second element, so the relation is actually 24:38.

It is hard to tell, why jit-code is so different for both versions without debugging pypy. Probably it is just bad luck, that the sieve_par is slower.

Your answer is purely speculative. If you're describing PyPy then it is wrong, sorry. — Armin Rigo, Jul 07 '17 at 13:23
@ArminRigo You are right, it just a try to explain the experimental findings, because they can not be explained only in terms of memory band-width+caching. I'm very thankful to everybody who would correct my misunderstandings! — ead, Jul 07 '17 at 13:34
In this kind of boring example, PyPy implements the list as an array of (machine-sized) integers, and unboxes all integers. That means that PyPy's JIT produces code close to what C would, and the difference is mostly missing low-level optimizations at which gcc is better. Maybe someone should first try to write actual C code, to check if we *do* get different behaviour; that sounds like something that should be done before assuming we'll get a difference and guessing why. — Armin Rigo, Jul 07 '17 at 17:58
I don't know which answer can be marked as "correct" but I'm giving you the bounty for the experimental effort to determine pypy vs C differences. — qwr, Jul 08 '17 at 18:02
@qwr I guess both answers didn't answer your question in full. — ead, Jul 10 '17 at 04:42
I was kinda hoping someone who really knew the inner workings of pypy could give a authoritative answer, but it's more a curiosity to me than something actually serious, so I'm fine with the answers. — qwr, Jul 10 '17 at 21:25
@ArminRigo Sorry to bother you again. I did some more digging and now it looks like the difference is due to a more efficient jit-code of the first version. Does it sound right? What could be the reason these differences? If compiled with C, the assembly of both version would not differ much... — ead, Jul 13 '17 at 20:31

Why is this improved sieve slower with pypy?

2 Answers2