Relative performance of large and small lists and bytearrays of boolean values

Question

I was writing a simple Sieve of Eratosthenes in which one produces a list of all of primes up to some number N without performing any division or modular arithmetic. Abstractly, my implementation uses an array of N boolean values which all start out False and are eventually flipped to True over the course of the algorithm.

I wanted to know if this would be faster and/or use less memory if I implemented it as a list of 0 and 1, a list of True and False, or a bytearray of 0 and 1.

Timing (python 2.7)

Using python 2.7, I found the following when using N = 10k and N = 30M:

$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_list(3000000)'
10 loops, best of 3: 1.42 sec per loop
$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_byte(3000000)'
10 loops, best of 3: 1.23 sec per loop
$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_list2(3000000)'
10 loops, best of 3: 1.65 sec per loop

$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_list(10000)'
500 loops, best of 3: 3.59 msec per loop
$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_byte(10000)'
500 loops, best of 3: 4.12 msec per loop
$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_list2(10000)'
500 loops, best of 3: 4.25 msec per loop

            10k       3M
byte (01)   4.12 ms   1.23 s
list (01)   3.59 ms   1.42 s
list (TF)   4.25 ms   1.65 s

What surprises me is that for small values of N, the list of integers was the best, and for large values of N, the bytearray was the best. The list of True and False was always slower.

Timing (python 3.3)

I also repeated the test in python 3.3:

$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_list(3000000)'
10 loops, best of 3: 2.05 sec per loop
$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_byte(3000000)' 
10 loops, best of 3: 1.76 sec per loop
$ python -m timeit -s 'import sieve' -n 10 'sieve.soe_list2(3000000)'
10 loops, best of 3: 2.02 sec per loop

$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_list(10000)'
500 loops, best of 3: 5.19 msec per loop
$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_byte(10000)'
500 loops, best of 3: 5.34 msec per loop
$ python -m timeit -s 'import sieve' -n 500 'sieve.soe_list2(10000)'
500 loops, best of 3: 5.16 msec per loop

            10k       3M
byte (01)   5.34 ms   1.76 s
list (01)   5.19 ms   2.05 s
list (TF)   5.16 ms   2.02 s

Here there was the same ordering with the list being better for small N, and the bytearray for large N, but the list with True and False was not significantly different than the list with 1 and 0.

Memory Use

The memory use was exactly the same in python 2.7 and 3.3. I used sys.getsizeof on the list or bytearray, which was the same size at the beginning and end of the algorithm.

>>> import sieve
>>> sieve.verbose = True
>>> x = sieve.soe_list(10000)
soe_list, 10000 numbers, size = 83120
>>> x = sieve.soe_byte(10000)
soe_byte, 10000 numbers, size = 10993
>>> x = sieve.soe_list2(10000)
soe_list2, 10000 numbers, size = 83120
>>> x = sieve.soe_list(3000000)
soe_list, 3000000 numbers, size = 26791776
>>> x = sieve.soe_byte(3000000)
soe_byte, 3000000 numbers, size = 3138289
>>> x = sieve.soe_list2(3000000)
soe_list2, 3000000 numbers, size = 26791776

            10k       3M
byte (01)   ~11k      ~3.1M
list (01)   ~83k      ~27M
list (TF)   ~83k      ~27M

~~I was a bit surprised~ that the large bytearray used more memory than the large list given that the large bytearray was faster.~~

EDIT: Oops, as pointed out in the comments, I read my own values wrong and interpreted 27M as 2.7M. The list is really much bigger.

The Question

Can anyone explain why this algorithm runs faster using a list for small N, and faster using a bytearray for large N?

Test code for reference

sieve.py:

import sys

if sys.version_info.major == 3:
    xrange = range

verbose = False

def soe_byte(upper):
    numbers = bytearray(0 for _ in xrange(0,upper+1))
    if verbose:
        print("soe_byte, {} numbers, size = {}".format(upper, sys.getsizeof(numbers)))
    primes = []
    cur = 2
    while cur <= upper:
        if numbers[cur] == 1:
            cur += 1
            continue
        primes.append(cur)
        for i in xrange(cur,upper+1,cur):
             numbers[i] = 1
    return primes

def soe_list(upper):
    numbers = list(0 for _ in xrange(0,upper+1))
    if verbose:
        print("soe_list, {} numbers, size = {}".format(upper, sys.getsizeof(numbers)))
    primes = []
    cur = 2
    while cur <= upper:
        if numbers[cur] == 1:
            cur += 1
            continue
        primes.append(cur)
        for i in xrange(cur,upper+1,cur):
             numbers[i] = 1
    return primes

def soe_list2(upper):
    numbers = list(False for _ in xrange(0,upper+1))
    if verbose:
        print("soe_list2, {} numbers, size = {}".format(upper, sys.getsizeof(numbers)))
    primes = []
    cur = 2
    while cur <= upper:
        if numbers[cur] == True:
            cur += 1
            continue
        primes.append(cur)
        for i in xrange(cur,upper+1,cur):
             numbers[i] = True
    return primes

You seem to have an error in your size calculytion. The memory usage in bytes for the lists should be ~27MB, which is way more then the 3.1MB used for the bytearray. — mata, May 30 '15 at 14:09
You are right - that is a human error - I read off the values wrong when I made the table. I think maybe that explains it too. Maybe `list` is faster than `bytearray` overall, but I am just getting cache misses when the memory gets big enough? — Eric Appelt, May 30 '15 at 14:12
The summary of the python 3 times, seems to be copied from the python 2.7 section. It doesn't agree with your actual timing outputs. — Paul Rooney, May 30 '15 at 14:14
Yep, I really messed up the summary tables, which I put to try to make this easier to read - thanks, fixing it now. — Eric Appelt, May 30 '15 at 14:17
Using `sys.getsizeof` on a list doesn't give a true indication of the amount of memory that the list uses. A Python list object consists of a small amount of "housekeeping" info plus an array of pointers (memory references). `sys.getsizeof` only tells you how much RAM that stuff occupies, it doesn't include the memory used to hold the actual Python objects referenced by those pointers. So if you have a list of 100 ints and replace each of those ints by a bunch of huge objects the list size as returned by `sys.getsizeof` won't change. — PM 2Ring, May 30 '15 at 14:46
FWIW, here's a link to the C source for the standard [Python list object](http://svn.python.org/projects/python/trunk/Objects/listobject.c). — PM 2Ring, May 30 '15 at 14:47
Sorry I don't have an answer to your question, but you can speed up your algorithm by replacing those inner `for` loops with slice assignments, as in [this sieve implementation](http://stackoverflow.com/a/26193172/4014959). I just did some timing tests, using `array.arrays`and `bytearrays` instead of lists. When sieving upto 3000000 it didn't make much difference in time, although the plain list version is fastest, since the other versions need to convert lists, and the `array.array` version needs to do that in the loop, since you can't slice assign to an `array.array` from a list. — PM 2Ring, May 30 '15 at 14:54

score 2 · Accepted Answer · 2015-06-02T17:51:01.190

Can anyone explain why this algorithm runs faster using a list for small N, and faster using a bytearray for large N?

This is all very implementation-specific, but you'll see this kind of phenomena occur commonly in practice where using smaller data types will perform worse on small inputs and better on large.

For example, it's common to see this kind of thing happening if you're using bitwise logic to extract the nth bit (where n is a variable) vs. just working with an array of booleans. Extracting the nth bit from a byte requires more instructions when n is a runtime variable than just setting the whole byte, but the bitset uses less space.

Broadly speaking, it's generally because the instructions used to access those smaller types are more expensive, but the hardware cache is so much faster than DRAM access and the improved spatial locality you get from using smaller types more than makes up for it as your input sizes get larger and larger.

In other words, spatial locality plays an increasingly important role as you hit those bigger inputs, and smaller data types give you more of it (allowing you to fit more adjacent elements into a cache line). You might also get improved temporal locality (more frequently accessing the same adjacent elements in a cache line). So even if the elements require more instructions or more expensive instructions when they've been loaded into registers, that overhead is more than compensated by the fact that now you're accessing more memory from the cache.

Now as to why bytearrays might require more instructions or more expensive instructions than a list of integers, I'm not sure. That's extremely case-by-case, implementation-specific details. But perhaps in your case, it's trying to load the nth element of a bytearray into dword-aligned boundaries and dword-sized registers, for example, and having to extract the specific byte to modify within the register using additional instructions and operands. This is all speculation unless we know the exact machine instructions emitted by your Python compiler/interpreter for your specific machine. But whatever the case may be, your tests suggest that bytearrays require more expensive instructions to access, but are much more cache-friendly (which more than compensates as you hit those larger inputs).

In any case, you'll see this kind of thing happening a lot when it comes to smaller data types vs. larger ones. This includes compression where processing compressed data, created carefully with attention to hardware details like alignment, can sometimes outperform uncompressed data because the additional processing required to decompress data is compensated by the improved spatial locality of the compressed data, but only for sufficiently-large inputs where memory access starts to play a more critical role.

Thanks for the great explanation. It turns out that the hardware I was using has 3MB of L3 cache, and I was testing a 3M element bytearray, so it makes perfect sense that the faster list was just missing the cache. — Eric Appelt, May 30 '15 at 16:50