What's the best data structure to use for a sieve (i.e. list of numbers where some get crossed out)?

Question

I am implementing the Sieve of Eratosthenes. I am stuck on the first step: deciding which data structure to use. Put simply, the Sieve of Eratosthenes starts with a list of consecutive numbers (e.g. 2,3,4,5,...99, 100). Then, certain numbers are crossed out and never considered again. What's the best data structure to use? The limit (100 in the example) is not known until run time, though 2 is always the starting number.

I have seen a few different solutions. The limit n is of type mpz_class

bool primes[n]
bool* primes
std::vector<bool> primes
std::bitset<n> primes
vector with the parameter int or unsigned char because they would would be more efficient than bool
for vector and array based solutions, allocate n+1 elements (as seen here) I guess this is so the value of n would be at the last element and fewer subtractions are needed inside a loop.

Some of this information comes from this Code Review question. @Jamal states "there are actually issues with std::vector specifically" but doesn't elaborate. Also I red somewhere that the size of unsigned char is guaranteed to be equal or smaller than bool making it the better choice.

You could try them all and benchmark. They all use `operator[]` to access so a typedef will let you switch type; recompile and test. — Richard Critten, Nov 17 '19 at 12:25
The suggestion from @RichardCritten is the right one. I would add to it that benchmarking would also tell you which data structure is the most efficient for some problem size, because one could be faster for some value of `n`, another for a much higher value of `n` — m.raynal, Nov 17 '19 at 13:17

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

tl;dr meassuring is the only way to be sure, but here are my thougths on this:

In general

bool primes[n] - a simple array of bools - problem is, that is has a fixed size, so it's not usable if n is not known at compile time. Do notice that some compilers will allow VLA's, but that's not the standard and should be avoided.
bool* primes = new bool[n] - also a simple array, just with dynamic allocation.
std::vector<bool> primes - essentially a wrapper around a bool*. May be implemented as space efficient bitset.
std::bitset<n> primes - space efficient storage, but also with fixed size sadly.
std::vector<int> or std::vector<usigned char> - Same as 3), just that it has no optimized implementation as bitset.

If it is a requirement that n is not known until runtime it boils down to some std::vector or dynamic allocated array. If n would have a reasonable upper bound, one could still think about static arrays.

Bitset vs array/vector

In short: Bitsets are optimized for space, essentially storing 8 booleans per byte, but they need bitwise operations to extract information, which takes some cpu-cycles that are not required in an array/vector. See also this question

I'm not sure how much difference the higher information density of a bitset can have on cache performance however.

vector of bool vs vector of unsigned char

If std::vector<bool> is not implemented as space efficient bitset it might take up more memory than a std::vector<unsigned char>, since sizeof bool can be larger than 1 and thus larger than sizeof (unsigned char). (see also here)

If there is a space-efficient implementation std::vector<unsigned char> takes more memory but probably is faster.

Performance of vector

sd std::vector::operator[] does not perform bound checking and thus does not add much performance overhead (if any) to the array access in comparison to a bool*. In addition you have the benefit that some implementations do bounds checking in debug mode (and only in debug mode). std::vector is definitly more comfortable to use than an array.

Conclusion

std::vector<unsigned char> might be the best way if there is sufficient memory available. But in the end benchmarking for different n with optimizations on will be the only way to be sure.

Is there a reason why `std::queue` wouldn't be used instead of `std::vector`? It doesn't make sense to delete or insert elements so I thought queue might be better. I noticed in general, with C++, vector is very popular even when many of it's operators aren't needed. — northerner, Nov 18 '19 at 09:40
`std::queue` is only a wrapper aroung a `std::vector` (or any other STL-container). And yes, vectors are really popular, because they abstract all the memory management. — Lukas-T, Nov 18 '19 at 09:44
`tl;dr meassuring is the only way to be sure,` are the results expected to be similar between different architectures and compilers? — northerner, Jul 25 '20 at 10:41