TL,DR;
As a C-program this would be a memory bound algorithm. However, even jit-compiled pypy-code has considerably more overhead and the operations are no longer "for free". Surprisingly (or maybe not), rhese two sieve
versions have different jit-code, it is probably just bad luck that the second version results in a slower code.
If it would be C, @Armin's answer would be the right one. It is well known, that for modern computers/caches and memory-bound code it does not matter if we jump over an integer - nevertheless all values must be fetched from memory and this is a bottle-neck. See this article for a great explanation.
Yet my experiments show that the non-optimized version (sieve
) is slightly faster than the optimized version (sieve_var
). Timings also shown, that the last line of sieve
i.e. [i for i in range(2, n) if nums[i] == 0]
is executed faster than the line of sieve_var
- return [2] + [i for i in range(3, n, 2) if nums[i] == 0]
.
On my machine it was0.45
seconds vs. 0.65
seconds for 10^8
elements. These numbers may vary from machine to machine, so it is quite possible that somebody with faster CPU and slower memory would not see any difference at all. If it could be explained in terms of "memory dominates all", then we should be able to see, that the slower version has more cache misses as the faster version.
However, by running valgrind --tool=cachegrind pypy sieveXXX.py
we can see, that there is almost no difference in number of cache misses, at least nothing that would explain the observable difference.
Let's consider a slightly simpler version, which exhibits exactly the same behavior - we don't save the primes, but just count them:
def sieve(n):
...
res=0
for i in range(2, n):
if nums[i] == 0:
res+=1
return res
def sieve_var(n):
...
res=1
for i in range(3, n,2):
if nums[i] == 0:
res+=1
return res
The first version is still faster: 0.35
sec. vs. 0.45
sec (to make sure the time difference is no fluke and not due to some jit-warmup, I put the last part of the code into a for-loop and got always the same timings).
Before going further, let's take a look at a C-implementation and its assembly
long long int sum(long long int *a, int n){
long long int res=0;
for(int i=2;i<n;i++)
if(a[i]==0)
res++;
return res;
}
compiled with gcc and -Os
we get:
movl $2, %edx
xorl %eax, %eax
.L4:
cmpl %edx, %esi
jle .L1
cmpq $0, (%rdi,%rdx,8)
jne .L3
incq %rax
.L3:
incq %rdx
jmp .L4
.L1:
ret
Pretty small and straight forward and takes only 0.08
seconds on my machine. My memory can go as fast as 10 GB/s and there are 8*10^8
bytes - so the whole time was basically needed to get the data.
But from this we also see, that the pypy-version has about 0.25
seconds overhead compared to the C-code. Where does it come from? By utilizing the vmprof-module we can see the jit-code and:
- for one iteration of the loop, there are many more operations than in the C-version
- the versions for
sieve
and sieve_par
are very different. We can use debugger to count the number of instruction of an iteration: 24
for sieve
and 76
-operations for sieve_var
which handles only every second element, so the relation is actually 24:38
.
It is hard to tell, why jit-code is so different for both versions without debugging pypy. Probably it is just bad luck, that the sieve_par
is slower.