6

Similar to why use True is slower than use 1 in Python3 but I'm using pypy3 and not using the sum function.

def sieve_num(n):
    nums = [0] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == 0:
            for j in range(i*i, n, i):
                nums[j] = 1

    return [i for i in range(2, n) if nums[i] == 0]


def sieve_bool(n):
    nums = [False] * n
    for i in range(2, n):
        if i * i >= n: break
        if nums[i] == False:
            for j in range(i*i, n, i):
                nums[j] = True

    return [i for i in range(2, n) if nums[i] == False]

sieve_num(10**8) takes 2.55 s, but sieve_bool(10**8) takes 4.45 s, which is a noticeable difference.

My suspicion was that [0]*n is somehow smaller than [False]*n and fits into cache better, but sys.getsizeof and vmprof line profiling are unsupported for PyPy. The only info I could get is that <listcomp> for sieve_num took 116 ms (19% of total execution time) while <listcomp> for sieve_bool tool 450 ms (40% of total execution time).

Using PyPy 7.3.1 implementing Python 3.6.9 on Intel i7-7700HQ with 24 GB RAM on Ubuntu 20.04. With Python 3.8.10 sieve_bool is only slightly slower.

qwr
  • 9,525
  • 5
  • 58
  • 102
  • 2.86s and 3.55s on Python 3.7.13 (PyPy 7.3.9). I guess equality comparisons with boolean type are more expensive than integers for some reason - perhaps because there's one more type in the MRO? – wim Feb 19 '23 at 02:11
  • @wim I don't know how python does equality comparison, but even writing like `if not num[i]` still has a performance hit – qwr Feb 19 '23 at 03:09
  • If you're actually searching for performance, there are some cleverly optimized sieves over [here](https://stackoverflow.com/q/2068372/674039). But perhaps you're just curious about the technical reasons for PyPy's relative slowness with the bools.. – wim Feb 19 '23 at 04:25
  • my code is fast enough for my purposes (haven't needed segmented sieve). I am just interested in the dark arts of pypy optimization – qwr Feb 19 '23 at 04:42

1 Answers1

7

The reason is that PyPy uses a special implementation for "list of ints that fit in 64 bits". It has got a few other special cases, like "list of floats", "list of strings that contain only ascii", etc. The goal is primarily to save memory: a list of 64-bit integers is stored just like an array.array('l') and not a list of pointers to actual integer objects. You save memory not in the size of the list itself---which doesn't change---but in the fact that you don't need a very large number of small additional integer objects all existing at once.

There is no special case for "list of boolean", because there are only ever two boolean objects in the first place. So there would be no memory-saving benefit in using a strategy like "list of 64-bit ints" in this case. Of course, we could do better and store that list with only one bit per entry, but it is not a really common pattern in Python; we just never got around to implementing that.

So why it is slower, anyway? The reason is that in the "list of general objects" case, the JIT compiler needs to produce extra code to check the type of objects every time it reads an item from the list, and extra GC logic every time it puts an item into the list. This is not a lot of code, but in your case, I guess it doubles the length of the (extremely short) generated assembly for the inner loop doing nums[j] = 1.

Right now, both in PyPy and CPython(*), the fastest is probably to use array.array('B') instead of a list, which both avoids that PyPy-specific issue and also uses substantially less memory (always a performance win if your data structures contain 10**8 elements).

EDIT: (*) no, turns out that CPython is probably too slow for the memory bandwidth to be a limit. On my machine, PyPy is maybe 30-35% faster when using bytes. See also comments for a hack that speeds up CPython from 9x to 3x slower than PyPy, but which as usual slows down PyPy.

Armin Rigo
  • 12,048
  • 37
  • 48
  • *"both in PyPy and CPython, the fastest is probably to use `array.array('B')`"* - In CPython, that seems a little slower. `bytearray` seems a bit faster, and `bytearray` with slice assignment (instead of the loop) three times faster: – Kelly Bundy Feb 19 '23 at 10:37
  • [Attempt This Online!](https://ato.pxeger.com/run?1=3VRLTsMwEN2w8ilm18RKUVskhCKFBQfgAlFUpTCmUxonchxEz8KmGzgUd-AO-FM3IXTFBqmWIjnvzbyZN0rm7aPZ6XUt9_v3TovpzefFF1VNrTSUSpU7JlRdgaYK4QDbO2OPKKAlfMGl7KpIxikDc8y9hQzyWQEcpINErYCAJKhSPmG0SCAE20PCkNw8txnIFFYKy-chaQVzKiDLYNZnBd1Nr0ucjHICFP8MC13lGyMCc8cp1J2SkNPJ5sZli6FZN5OxXQdeempyN0ms__hMB7DaaTw5hCFxzr4X_2bcM5zS1NKDuvncfW5blNEwJv6bZ2nHNeP8muFrgw_avN7XEplNX_bpV86TBYUFj7sgGf4pyXh6v4CF92ir2MUShabbbmsxEblWPNookjryYTAF7dFjl4eL6cinO7ZsWzRLKwiGKBbUYr_1DssvLMFv) – Kelly Bundy Feb 19 '23 at 10:38
  • I tested `bytearray` and `array.array('B')` in pypy and the results are about the same. – qwr Feb 19 '23 at 16:52
  • Measured (thanks Kelly for the code), indeed I'm getting the same result as @qwr on pypy. Added a paragraph to the main answer – Armin Rigo Feb 20 '23 at 07:22
  • I hope you mean "hack" in a positive way :-). I think that's a normal thing to do. [Sped up 2x further](https://ato.pxeger.com/run?1=rVNLbsMgFFx0UYlTvCVYjmIni1aW3CP0ApYVOSm0T7XBAlw1Z-kmm_ZQPU3BBOdTL4Nk6WkG5g3jx9dPv7dvSh4O34MVi8ffu3uhVQcWOw7Y9UrbsSYjipZrq1RrIrVTXa-5MamrBmkJeeECDPIPvtnuLW-0bvYrKllBwC05dAZKmBhHjLhQGhBQgm7kK6erFOIJv1A4MnHfUwmygK3mzfs56VUrrKEsITudiv0qTLAoPH3Wt8pr5iRbLun5nuBGcztoCRXO-rruWM_eeT3dudc-yZnm8kRXmbcXytyX2Q1iCXJYX0ZyRGcyyb2pBaWOg4W_6HIJeJFIi8bS-Mfp-MMpS4MkY4Twz57vrBN9VtINjHO_Oblfj9Y9KDx4PSPpvwSDby_n549GJ2ZoPSZoniXJA4shOithm_NuAzrZORaudTg-so0xbpYnwbiLRDUW3sPxWcTn8Qc) now. – Kelly Bundy Feb 20 '23 at 07:47
  • It's using itertools, which I think won't help PyPy, so probably gets CPython down to 1.5x slower than PyPy? I haven't tried optimizing it for PyPy, I'm not familiar enough with it and can't test with a current PyPy version. – Kelly Bundy Feb 20 '23 at 07:54