prime number below 2 billion - usage of std::list hinders performance

Question

Problem Statement is to find prime number below 2 billion in timeframe < 20 sec. I followed below approaches.

Divide the number n with list of number k ( k < sqrt(n)) - took 20 sec
Divide the number n with list of prime number below sqrt(n).In this scenario I stored prime numbers in std::list - took more than 180 sec

Can someone help me understand why did 2nd approach take longtime even though we reduced no of divisions by 50%(approx)? or Did I choose wrong Data Structure?

Approach 1:

#include <iostream>
#include<list>
#include <ctime>
using namespace std;

list<long long> primeno;
void ListPrimeNumber();

int main()
{
    clock_t time_req = clock();
    ListPrimeNumber();
    time_req = clock() - time_req;
    cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
    return 0;
}

void check_prime(int i);

void ListPrimeNumber()
{
    primeno.push_back(2);
    primeno.push_back(3);
    primeno.push_back(5);
    for (long long i = 6; i <= 20000000; i++)
    {
        check_prime(i);
    }

}

void check_prime(int i)
{
    try
    {
        int j = 0;
        int limit = sqrt(i);
        for (j = 2 ; j <= limit;j++)
        {
            if(i % j == 0)
            {
                break;
            }
        }
        if( j > limit)
        {
            primeno.push_back(i);
        }
    }

    catch (exception ex)
    {
        std::cout << "Message";
    }
}

Approach 2 :

#include <iostream>
#include<list>
#include <ctime>
using namespace std;

list<long long> primeno;
int noofdiv = 0;
void ListPrimeNumber();

int main()
{
    clock_t time_req = clock();
    ListPrimeNumber();
    time_req = clock() - time_req;
    cout << "time taken " << static_cast<float>(time_req) / CLOCKS_PER_SEC << " seconds" << endl;
    cout << "No of divisions : " << noofdiv;
    return 0;
}

void check_prime(int i);

void ListPrimeNumber()
{
    primeno.push_back(2);
    primeno.push_back(3);
    primeno.push_back(5);
    for (long long i = 6; i <= 10000; i++)
    {
        check_prime(i);
    }

}

void check_prime(int i)
{
    try
    {
        int limit = sqrt(i);
        for (int iter : primeno)
        {
            noofdiv++;
            if (iter <= limit && (i%iter) == 0)
            {
                break;
            }
            else if (iter > limit)
            {
                primeno.push_back(i);
                break;
            }
        }
    }

    catch (exception ex)
    {
        std::cout << "Message";
    }
}

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/189623/discussion-on-question-by-apoorva-raju-prime-number-below-2-billion-usage-of-s). — Samuel Liew, Mar 07 '19 at 22:42
to whom it may concern: **[a test at ideone](https://ideone.com/VPOnux) completely contradicts the claims made in this question.** — Will Ness, Mar 09 '19 at 08:36
Approach 3: get the list online :) (for example here: http://www.primos.mat.br/2T_en.html) — fbparis, Mar 11 '19 at 10:11
@fbparis they are not asking for advice on how to make it faster. they are asking why it is slower -- but it isn't, according to the test on ideone. quite the opposite, it is faster, just as expected. — Will Ness, Mar 11 '19 at 10:32

tkausl · Accepted Answer · 2019-03-06T17:35:09.227

The reason your second example takes longer is you're iterating a std::list.

A std::list in C++ is a linked list, which means it doesn't use contiguous memory. This is bad because to iterate the list you must jump from node to node in a (to the CPU/prefetcher) unpredictable way. Also, You're most likely only "using" a few bytes of each cacheline. RAM is slow. Fetching a byte from RAM takes a lot longer than fetching it from L1. CPUs are fast these days, so your program is most of the time not doing anything and waiting for memory to arrive.

Use a std::vector instead. It stores all values one after the other and iterating is very cheap. Since you're iterating forward in memory without jumping, you're using the full cacheline and your prefetcher will be able to fetch further pages before you need them because your access of memory is predictable.

It has been proven by numerous people, including Bjarne Stroustrup, that std::vector is in a lot of cases faster than std::list, even in cases where the std::list has "theoretically" better complexity (random insert, delete, ...) just because caching helps a lot. So always use std::vector as your default. And if you think a linked list would be faster in your case, measure it and be surprised that - most of the time - std::vector dominates.

Edit: as others have noted, your method of finding primes isn't very efficient. I just played around a bit and implemented a Sieve of Eratosthenes using a bitset.

constexpr int max_prime = 1000000000;
std::bitset<max_prime> *bitset = new std::bitset<max_prime>{};
// Note: Bit SET means NO prime
bitset->set(0);
bitset->set(1);

for(int i = 4; i < max_prime ; i += 2)
    bitset->set(i); // set all even numbers
int max = sqrt(max_prime);
for(int i = 3; i < max; i += 2) { // No point testing even numbers as they can't be prime
    if(!bitset->test(i)) { // If i is prime
        for(int j = i * 2; j < no_primes; j += i)
            bitset->set(j); // set all multiples of i to non-prime
    }
}

This takes between ~~4.2 and 4.5 seconds~~ 30 seconds (not sure why it changed that much after slight modifications... must be an optimization I'm not hitting anymore) to find all primes below one Billion (1,000,000,000) on my machine. Your approach took way too long even for 100 million. I cancelled the 1 Billion search after about two minutes.

Comparison for 100 million:

time taken:                63.515 seconds
time taken bitset:         1.874 seconds
No of divisions :          1975961174
No of primes found:        5761455
No of primes found bitset: 5761455

I'm not a mathematician so I'm pretty sure there are still ways to optimize it further, I only optimize for even numbers.

@tkausi Thanks a ton for explanation...you solution helped me to reduce execution time from 180 sec to 22 sec .... will try to optimization more by exploring options given by Bathsheba and increase the performance — Apoorva Raju, Mar 06 '19 at 12:09
@ApoorvaRaju your timings are so astronomical that there must be something else going on with your *algorithm*. normally the time for this should be in milliseconds, or less. — Will Ness, Mar 06 '19 at 13:53
@ApoorvaRaju I've added a sieve implementation which should be a good starting point for further improvements. — tkausl, Mar 06 '19 at 14:34
FYI the code from [this answer of mine](https://stackoverflow.com/a/9557173/849891) takes 0 seconds on ideone to calculate 10 primes above 2 billion. Should be straightforward to amend to finding the topmost of n primes *below* a given value (or just finding the 1st one below, but that would mean counting backwards and adjusting the logic for that, which could be a little bit trickier). — Will Ness, Mar 06 '19 at 17:19
@tkausl the reason it takes so much time is that you calculate approximately 100 million primes, below the 2 billion. you don't need that; about 4650 primes, below the `sqrt` of 2 billion, is enough. — Will Ness, Mar 06 '19 at 17:31
@WillNess You're right, not sure how I missed that. Also, thanks for linking your algorithm; I don't understand it yet but I'll certainly study it. — tkausl, Mar 06 '19 at 17:41
just a sieve of Eratosthenes up to the `sqrt(top + delta)`, hopscotching to the separate *"offset"* segment from `top` to `(top + delta)`. — Will Ness, Mar 06 '19 at 17:42
FYI [a test at ideone](https://ideone.com/VPOnux) completely contradicts the claims made in this question. — Will Ness, Mar 06 '19 at 19:08

score 2 · Answer 2 · answered Mar 06 '19 at 11:55

The first thing to do is make sure you are compiling with optimisations enabled. The c++ standard library template classes tend to perform very poorly with unoptimised code as they generate lots of function calls. The optimiser inlines most of these function calls which makes them much cheaper.

std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).

For the case where you are only appending to the end of a list std::list has the following issues:

Iterating through the list is relatively expensive as the code has to follow node pointers and then retrieve the data
The list uses quite a lot more memory, each element needs a pointer to the previous and next nodes in addition to the actual data. On a 64-bit system this equates to 20 bytes per element rather than 4 for a list of int
As the elements in the list are not contiguous in memory the compiler can't perform as many SIMD optimisations and you will suffer more from CPU cache misses

A std::vector would solve all of the above as its memory is contiguous and iterating through it is basically just a case of incrementing an array index. You do need to make sure that you call reserve on your vector at the beginning with a sufficiently large value so that appending to the vector doesn't cause the whole array to be copied to a new larger array.

A bigger optimisation than the above would be to use the Sieve of Eratosthenes to calculate your primes. As generating this light require random deletions (depending on your exact implementation) std::list might perform better than std::vector though even in this case the overheads of std::list might not outweigh its costs.

`std::list is a linked list. Its is mostly useful where you want to insert or remove elements randomly (i.e. not from the end).` even in this case _one needs to measure performance_. `std::vector` is often even in these cases faster. — tkausl, Mar 06 '19 at 11:59
FYI [a test at ideone](https://ideone.com/VPOnux) completely contradicts the claims made in this question. — Will Ness, Mar 06 '19 at 19:11
@WillNess that's why the first part of my answer was to enable optimisations as that seemed the most likely cause of slowness — Alan Birtles, Mar 07 '19 at 07:29
if it's unrepeatable it didn't happen, is what scientific method tells us. the question is, could you repeat the observations in the question with whichever settings? my best guess still remains erroneous testing, because algorithms always win over optimizations, and there's a clear algorithmic advantage for "list" over "numbers" algorithm. — Will Ness, Mar 07 '19 at 10:00

Will Ness · Answer 3 · 2019-03-11T10:30:23.640

A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question:

/* check_prime__list:

              time taken       No of divisions         No of primes
    10M:    0.873 seconds        286144936                664579
    20M:    2.169 seconds        721544444               1270607   */

    2B:       projected time: at least 16 minutes but likely much more   (*)

/* check_prime__nums:

              time taken       No of divisions         No of primes
    10M:    4.650 seconds       1746210131                664579
    20M:   12.585 seconds       4677014576               1270607   */

    2B:       projected time: at least 3 hours but likely much more      (*)

I also changed the type of the number of divisions counter to long int because it was wrapping around the data type limit. So they could have been misinterpreting that.

But the run time wasn't being affected by that. A wall clock is a wall clock.

Most likely explanation seems to be a sloppy testing by the OP, with different values used in each test case, by mistake.

(*) The time projection was made by the empirical orders of growth analysis:

100**1.32 * 2.169 / 60 = 15.8

100**1.45 * 12.585 / 3600 = 2.8

Empirical orders of growth, as measured on the given range of sizes, were noticeably better for the list algorithm, n^1.32 vs. the n^1.45 for the testing by all numbers. This is entirely expected from theoretical complexity, since there are fewer primes than all numbers up to n, by a factor of log n, for a total complexity of O(n^1.5/log n) vs. O(n^1.5). It is also highly unlikely for any implementational discrepancy to beat an actual algorithmic advantage.

prime number below 2 billion - usage of std::list hinders performance

3 Answers3

A test at Ideone (the OP code with few superficial alterations) completely contradicts the claims made in this question: