30

I was trying to solve a coding problem in C++ which counts the number of prime numbers less than a non-negative number n.

So I first came up with some code:

int countPrimes(int n) {
    vector<bool> flag(n+1,1);
    for(int i =2;i<n;i++)
    {
        if(flag[i]==1)
            for(long j=i;i*j<n;j++)
                flag[i*j]=0;
    }
    int result=0;
    for(int i =2;i<n;i++)
        result+=flag[i];
    return result;
}

which takes 88 ms and uses 8.6 MB of memory. Then I changed my code into:

int countPrimes(int n) {
    // vector<bool> flag(n+1,1);
    bool flag[n+1] ;
    fill(flag,flag+n+1,true);
    for(int i =2;i<n;i++)
    {
        if(flag[i]==1)
            for(long j=i;i*j<n;j++)
                flag[i*j]=0;
    }
    int result=0;
    for(int i =2;i<n;i++)
        result+=flag[i];
    return result;
}

which takes 28 ms and 9.9 MB. I don't really understand why there is such a performance gap in both the running time and memory consumption. I have read relative questions like this one and that one but I am still confused.

EDIT: I reduced the running time to 40 ms with 11.5 MB of memory after replacing vector<bool> with vector<char>.

isanae
  • 3,253
  • 1
  • 22
  • 47
Qin Heyang
  • 1,456
  • 1
  • 16
  • 18
  • 6
    How are you compiling your code? What compiler options? Also; `vector` is special. – Jesper Juhl Apr 18 '19 at 11:45
  • 13
    be carfull, std::vector is a weird specialisation, more a bitset than a vector – Martin Morterol Apr 18 '19 at 11:49
  • 2
    @Martinm @JesperJuhl Yes. I reduced the running time to 40 ms with 11.5 MB memory after replacing `vector` with `vector`. Thank you. – Qin Heyang Apr 18 '19 at 11:58
  • 3
    you may gain some time changing one loop a bit: `for(int i = 2; i * i = n` then next loop does nothing. – Marek R Apr 18 '19 at 12:20
  • 3
    This doesn't address the question, but when you're dealing with boolean types, use `true` and `false` and not `1`. So: `vector flag(n+1, true);` and `if (flag[i])`. That doesn't affect the result, but it makes it much clearer what you're doing. – Pete Becker Apr 18 '19 at 12:53
  • 4
    If you are not already doing so, make sure to compile your code with compiler optimizations *enabled* before benchmarking. – Jesper Juhl Apr 18 '19 at 14:22
  • 2
    For which value of `n` did you give us the timing and memory usage? – Michiel uit het Broek Apr 19 '19 at 09:22

4 Answers4

29

std::vector<bool> isn't like any other vector. The documentation says:

std::vector<bool> is a possibly space-efficient specialization of std::vector for the type bool.

That's why it may use up less memory than an array, because it might represent multiple boolean values with one byte, like a bitset. It also explains the performance difference, since accessing it isn't as simple anymore. According to the documentation, it doesn't even have to store it as a contiguous array.

Blaze
  • 16,736
  • 2
  • 25
  • 44
16

std::vector<bool> is special case. It is specialized template. Each value is stored in single bit, so bit operations are needed. This memory compact but has couple drawbacks (like no way to have a pointer to bool inside this container).

Now bool flag[n+1]; compiler will usually allocate same memory in same manner as for char flag[n+1]; and it will do that on stack, not on heap.

Now depending on page sizes, cache misses and i values one can be faster then other. It is hard to predict (for small n array will be faster, but for larger n result may change).

As an interesting experiment you can change std::vector<bool> to std::vector<char>. In this case you will have similar memory mapping as in case of array, but it will be located at heap not a stack.

Jesper Juhl
  • 30,449
  • 3
  • 47
  • 70
Marek R
  • 32,568
  • 6
  • 55
  • 140
6

I'd like to add some remarks to the good answers already posted.

  • The performance differences between std::vector<bool> and std::vector<char> may vary (a lot) between different library implementations and different sizes of the vectors.

    See e.g. those quick benches: clang++ / libc++(LLVM) vs. g++ / libstdc++(GNU).

  • This: bool flag[n+1]; declares a Variable Length Array, which (despites some performance advantages due to it beeing allocated in the stack) has never been part of the C++ standard, even if provided as an extension by some (C99 compliant) compilers.

  • Another way to increase the performances could be to reduce the amount of calculations (and memory occupation) by considering only the odd numbers, given that all the primes except for 2 are odd.

If you can bare the less readable code, you could try to profile the following snippet.

int countPrimes(int n)
{
    if ( n < 2 )
        return 0;
    // Sieve starting from 3 up to n, the number of odd number between 3 and n are
    int sieve_size = n / 2 - 1;
    std::vector<char> sieve(sieve_size); 
    int result = 1;  // 2 is a prime.

    for (int i = 0; i < sieve_size; ++i)
    {
        if ( sieve[i] == 0 )
        {
            // It's a prime, no need to scan the vector again
            ++result;
            // Some ugly transformations are needed, here
            int prime = i * 2 + 3;
            for ( int j = prime * 3, k = prime * 2; j <= n; j += k)
                sieve[j / 2 - 1] = 1;
        }
    }

    return result;
}

Edit

As Peter Cordes noted in the comments, using an unsigned type for the variable j

the compiler can implement j/2 as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that j will always be non-negative.

It's also possible to reduce the number of candidates exploiting the fact that all primes (past 2 and 3) are one below or above a multiple of 6.

Bob__
  • 12,361
  • 3
  • 28
  • 42
  • 1
    Wow, 20 years after C99, and not even the most rudimentary VLAs... Looks like C++ is getting more and more behind C ;-) Afaik, there was once a proposal that would have made `int array[length];` legal in C++, but even that proposal did not allow for multidimensional arrays, neither on the stack (`int array2d[height][width]`) nor on the heap (`int (*array2d)[width] = new int[height][width];`). Apparently, that proposal was never approved, even though C goes as far as to allow the multidimensional case (`int (*array2d)[width] = malloc(height * sizeof(*array2d));`) since 20 years ago... – cmaster - reinstate monica Apr 18 '19 at 15:53
  • @cmaster Well, that's [debatable](https://stackoverflow.com/questions/1887097/why-arent-variable-length-arrays-part-of-the-c-standard) ;) – Bob__ Apr 18 '19 at 16:02
  • 1
    Yeah, I know. Especially since very few people actually understand the power of the C99 syntax - both of the top-scoring answers of the question you linked completely underestimate its value. Because, you see, C99 VLAs not just exist on the stack, they can also be allocated on the heap. I have already given an example of that, it's the last one. The real power comes from the fact that any C99 array type can have a runtime length, it can be pointed to or typedefed. And C99 does not care how many runtime length array types you nest, the indexing calculation comes out right. – cmaster - reinstate monica Apr 18 '19 at 20:09
  • 1
    Long story short: I really enjoy working with C++ for the most part, but when it comes to manipulating multidimensional arrays of plain data, (images, volumes, simulations), I'll just fall back to C99 for its VLA syntax. – cmaster - reinstate monica Apr 18 '19 at 20:11
  • I'd recommend `unsigned j` or `size_t j` to make sure the compiler can implement `j/2` as cheaply as possible. C signed division by a power of 2 has different rounding semantics (for negative dividends) than a right shift, and compilers don't always propagate value-range proofs sufficiently to prove that `j` will always be non-negative. It looks like gcc manages with just a `sar` (https://godbolt.org/z/-Q9dod), but MSVC has a `test/jns` to jump over an `add` 1 instruction. – Peter Cordes Apr 19 '19 at 09:03
  • But yes, wasting half your cache footprint on even numbers is silly, and easier to bake in and optimize for factoring out one small prime than `2` *and* `3`, which is sometimes done. – Peter Cordes Apr 19 '19 at 09:10
  • @PeterCordes Good points, thanks. Indeed, [gcc](http://quick-bench.com/Obyw2HX5Ay3xt-JUCP7UepAMNfM) seems to manage that optimization better than [others](http://quick-bench.com/65_ZCSXEG0lCVXAJ-vJo7CR077g). – Bob__ Apr 19 '19 at 09:24
  • 1
    gcc *does* show a difference there, because it's actually doing worse with signed `int`. Overall it makes faster code, but you can see that it using `sar` / `sub` / `movslq` to sign-extend back to 64-bit. I suggested using `size_t` re-extension to pointer width wouldn't be necessary. (Although zero-extension is free on x86-64; writing a 32-bit reg zero-extends into the 64-bit reg, but 64 lets it use `-1` in the addr mode.) Anyway, `unsigned long` *does* give an extra speedup with gcc better than `unsigned`. http://quick-bench.com/4t9DE9nsT-oQYSNo0leeyYZ8GvY. (`long int` also is the same.) – Peter Cordes Apr 19 '19 at 09:38
0

I am getting different timings and memory usage than the ones mentioned in the question when compiling with g++-7.4.0 -g -march=native -O2 -Wall and running on a Ryzen 5 1600 CPU:

  • vector<bool>: 0.038 seconds, 3344 KiB memory, IPC 3.16
  • vector<char>: 0.048 seconds, 12004 KiB memory, IPC 1.52
  • bool[N]: 0.050 seconds, 12644 KiB memory, IPC 1.69

Conclusion: vector<bool> is the fastest option because of its higher IPC (instructions per clock).


#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <vector>

size_t countPrimes(size_t n) {
    std::vector<bool> flag(n+1,1);
    //std::vector<char> flag(n+1,1);
    //bool flag[n+1]; std::fill(flag,flag+n+1,true);
    for(size_t i=2;i<n;i++) {
        if(flag[i]==1) {
            for(size_t j=i;i*j<n;j++) {
                flag[i*j]=0;
            }
        }
    }
    size_t result=0;
    for(size_t i=2;i<n;i++) {
        result+=flag[i];
    }
    return result;
}

int main() {
    {
        const rlim_t kStackSize = 16*1024*1024; 
        struct rlimit rl;
        int result = getrlimit(RLIMIT_STACK, &rl);
        if(result != 0) abort();
        if(rl.rlim_cur < kStackSize) {
            rl.rlim_cur = kStackSize;
            result = setrlimit(RLIMIT_STACK, &rl);
            if(result != 0) abort();
        }
    }

    printf("%zu\n", countPrimes(10e6));
    return 0;
}
atomsymbol
  • 370
  • 8
  • 11