Performance gap between vector and array

Question

I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.

So I first came up with some code:

int countPrimes(int n) {
    vector<bool> flag(n+1,1);
    for(int i =2;i<n;i++)
    {
        if(flag[i]==1)
            for(long j=i;i*j<n;j++)
                flag[i*j]=0;
    }
    int result=0;
    for(int i =2;i<n;i++)
        result+=flag[i];
    return result;
}

which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:

int countPrimes(int n) {
    // vector<bool> flag(n+1,1);
    bool flag[n+1] ;
    fill(flag,flag+n+1,true);
    for(int i =2;i<n;i++)
    {
        if(flag[i]==1)
            for(long j=i;i*j<n;j++)
                flag[i*j]=0;
    }
    int result=0;
    for(int i =2;i<n;i++)
        result+=flag[i];
    return result;
}

which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.

EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.

How are you compiling your code? What compiler options? Also; `vector` is special. — Jesper Juhl, Apr 18 '19 at 11:45
be carfull, std::vector is a weird specialisation, more a bitset than a vector — Martin Morterol, Apr 18 '19 at 11:49
@Martinm @JesperJuhl Yes. I reduced the running time to 40 ms with 11.5 MB memory after replacing `vector` with `vector`. Thank you. — Qin Heyang, Apr 18 '19 at 11:58
you may gain some time changing one loop a bit: `for(int i = 2; i * i = n` then next loop does nothing. — Marek R, Apr 18 '19 at 12:20
This doesn't address the question, but when you're dealing with boolean types, use `true` and `false` and not `1`. So: `vector flag(n+1, true);` and `if (flag[i])`. That doesn't affect the result, but it makes it much clearer what you're doing. — Pete Becker, Apr 18 '19 at 12:53
If you are not already doing so, make sure to compile your code with compiler optimizations *enabled* before benchmarking. — Jesper Juhl, Apr 18 '19 at 14:22
For which value of `n` did you give us the timing and memory usage? — Michiel uit het Broek, Apr 19 '19 at 09:22

score 29 · Accepted Answer · answered Apr 18 '19 at 11:52

std::vector<bool> isn't like any other vector. The documentation says:

std::vector<bool> is a possibly space-efficient specialization of std::vector for the type bool.

That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.

score 16 · Answer 2 · edited Apr 18 '19 at 12:15

std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).

Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.

Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).

As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.

Bob__ · Answer 3 · 2019-04-19T09:37:00.703

6

I'd like to add some remarks to the good answers already posted.

The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.

See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).
This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.
Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.

If you can bare the less readable code, you could try to profile the following snippet.

int countPrimes(int n)
{
    if ( n < 2 )
        return 0;
    // Sieve starting from 3 up to n, the number of odd number between 3 and n are
    int sieve_size = n / 2 - 1;
    std::vector<char> sieve(sieve_size); 
    int result = 1;  // 2 is a prime.

    for (int i = 0; i < sieve_size; ++i)
    {
        if ( sieve[i] == 0 )
        {
            // It's a prime, no need to scan the vector again
            ++result;
            // Some ugly transformations are needed, here
            int prime = i * 2 + 3;
            for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
                sieve[j / 2 - 1] = 1;
        }
    }

    return result;
}

Edit

As Peter Cordes noted in the comments, using an unsigned type for the variable j

the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.

It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.

edited Apr 19 '19 at 09:37

answered Apr 18 '19 at 14:59

Bob__

12,361
3
28
42

1

Wow, 20 years after C99, and not even the most rudimentary VLAs... Looks like C++ is getting more and more behind C ;-) Afaik, there was once a proposal that would have made `int array[length];` legal in C++, but even that proposal did not allow for multidimensional arrays, neither on the stack (`int array2d[height][width]`) nor on the heap (`int (*array2d)[width] = new int[height][width];`). Apparently, that proposal was never approved, even though C goes as far as to allow the multidimensional case (`int (*array2d)[width] = malloc(height * sizeof(*array2d));`) since 20 years ago... – cmaster - reinstate monica Apr 18 '19 at 15:53
@cmaster Well, that's [debatable](https://stackoverflow.com/questions/1887097/why-arent-variable-length-arrays-part-of-the-c-standard) ;) – Bob__ Apr 18 '19 at 16:02
1

Yeah, I know. Especially since very few people actually understand the power of the C99 syntax - both of the top-scoring answers of the question you linked completely underestimate its value. Because, you see, C99 VLAs not just exist on the stack, they can also be allocated on the heap. I have already given an example of that, it's the last one. The real power comes from the fact that any C99 array type can have a runtime length, it can be pointed to or typedefed. And C99 does not care how many runtime length array types you nest, the indexing calculation comes out right. – cmaster - reinstate monica Apr 18 '19 at 20:09
1

Long story short: I really enjoy working with C++ for the most part, but when it comes to manipulating multidimensional arrays of plain data, (images, volumes, simulations), I'll just fall back to C99 for its VLA syntax. – cmaster - reinstate monica Apr 18 '19 at 20:11
I'd recommend `unsigned j` or `size_t j` to make sure the compiler can implement `j/2` as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that `j` will always be non-negative. It looks like gcc manages with just a `sar` (https://godbolt.org/z/-Q9dod), but MSVC has a `test/jns` to jump over an `add` 1 instruction. – Peter Cordes Apr 19 '19 at 09:03
But yes, wasting half your cache footprint on even numbers is silly, and easier to bake in and optimize for factoring out one small prime than `2` *and* `3`, which is sometimes done. – Peter Cordes Apr 19 '19 at 09:10
@PeterCordes Good points, thanks. Indeed, [gcc](http://quick-bench.com/Obyw2HX5Ay3xt-JUCP7UepAMNfM) seems to manage that optimization better than [others](http://quick-bench.com/65_ZCSXEG0lCVXAJ-vJo7CR077g). – Bob__ Apr 19 '19 at 09:24
1

gcc *does* show a difference there, because it's actually doing worse with signed `int`. Overall it makes faster code, but you can see that it using `sar` / `sub` / `movslq` to sign-extend back to 64-bit. I suggested using `size_t` re-extension to pointer width wouldn't be necessary. (Although zero-extension is free on x86-64; writing a 32-bit reg zero-extends into the 64-bit reg, but 64 lets it use `-1` in the addr mode.) Anyway, `unsigned long` *does* give an extra speedup with gcc better than `unsigned`. http://quick-bench.com/4t9DE9nsT-oQYSNo0leeyYZ8GvY. (`long int` also is the same.) – Peter Cordes Apr 19 '19 at 09:38

score 0 · Answer 4 · answered Apr 19 '19 at 13:02

I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:

vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69

Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).

#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>

size_t countPrimes(size_t n) {
    std::vector<bool> flag(n+1,1);
    //std::vector<char> flag(n+1,1);
    //bool flag[n+1]; std::fill(flag,flag+n+1,true);
    for(size_t i=2;i<n;i++) {
        if(flag[i]==1) {
            for(size_t j=i;i*j<n;j++) {
                flag[i*j]=0;
            }
        }
    }
    size_t result=0;
    for(size_t i=2;i<n;i++) {
        result+=flag[i];
    }
    return result;
}

int main() {
    {
        const rlim_t kStackSize = 16*1024*1024; 
        struct rlimit rl;
        int result = getrlimit(RLIMIT_STACK, &rl);
        if(result != 0) abort();
        if(rl.rlim_cur < kStackSize) {
            rl.rlim_cur = kStackSize;
            result = setrlimit(RLIMIT_STACK, &rl);
            if(result != 0) abort();
        }
    }

    printf("%zu\n", countPrimes(10e6));
    return 0;
}

Performance gap between vector and array

4 Answers4