Sieve of Eratosthenes Pthread implementation: thread number doesn't affect computation time

Question

I'm trying to implement the parallel Sieve of Eratosthenes program with Pthread. I have finished my coding and the programs works correctly and as expected, which means that if I use more than 1 threads, the computation time would be less than the sequential program (only 1 thread is used). However, no matter how many extra threads I used, the computation time would be basically the same. For example, if I do the calculation from 1 to 1 billion, the sequential program used about 21 secs, and the parallel program with 2 threads used about 14 secs. But it would always takes about 14 secs when I used 3,4,5,10,20,50 threads as I tried. I want to know what leads to this problem and how to solve it. My code is listed below:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
//The group of arguments passed to thread
struct thrd_data{
  long id;
  long start;
  long end; /* the sub-range is from start to end */
};
typedef struct {
  pthread_mutex_t     count_lock;     /* mutex semaphore for the barrier */
  pthread_cond_t      ok_to_proceed;  /* condition variable for leaving */
  long                count;      /* count of the number of threads who have  arrived */
} mylib_barrier_t;

//global variable
bool *GlobalList;//The list of nature number
long Num_Threads;
mylib_barrier_t barrier;/* barrier */

void mylib_barrier_init(mylib_barrier_t *b)
{
  b -> count = 0;
  pthread_mutex_init(&(b -> count_lock), NULL);
  pthread_cond_init(&(b -> ok_to_proceed), NULL);
}

void mylib_barrier(mylib_barrier_t *b, long id) 
{
   pthread_mutex_lock(&(b -> count_lock));
   b -> count ++;
   if (b -> count == Num_Threads)
   {
     b -> count = 0; /* must be reset for future re-use */
     pthread_cond_broadcast(&(b -> ok_to_proceed));
   }
   else
   {
    while (pthread_cond_wait(&(b -> ok_to_proceed), &(b -> count_lock)) !=    0);

    }
    pthread_mutex_unlock(&(b -> count_lock));
}

void mylib_barrier_destroy(mylib_barrier_t *b) 
{
  pthread_mutex_destroy(&(b -> count_lock));
  pthread_cond_destroy(&(b -> ok_to_proceed));
}

void *DoSieve(void *thrd_arg)
{

  struct thrd_data *t_data;
  long i,start, end;
  long k=2;//The current prime number in first loop
  long myid;

  /* Initialize my part of the global array */
  t_data = (struct thrd_data *) thrd_arg;
  myid = t_data->id;
  start = t_data->start;
  end = t_data->end;

  printf ("Thread %ld doing look-up from %ld to %ld\n", myid,start,end);
  //First loop: find all prime numbers that's less than sqrt(n)
  while (k*k<=end) 
  {
      int flag;
      if(k*k>=start)
        flag=0;
      else
        flag=1;
      //Second loop: mark all multiples of current prime number
      for (i = !flag? k*k-1:start+k-start%k-1; i <= end; i += k)
        GlobalList[i] = 1;
      i=k;
      //wait for other threads to finish the second loop for current prime   number
      mylib_barrier(&barrier,myid);
      //find next prime number that's greater than current one
      while (GlobalList[i] == 1)
            i++;
         k = i+1;

   }
  //decrement the counter of threads before exit
  pthread_mutex_lock (&barrier.count_lock);
  Num_Threads--;
  if (barrier.count == Num_Threads)
  {
    barrier.count = 0;  /* must be reset for future re-use */
    pthread_cond_broadcast(&(barrier.ok_to_proceed));
  }
  pthread_mutex_unlock (&barrier.count_lock);
  pthread_exit(NULL);
}


int main(int argc, char *argv[])
{
  long i, n,n_threads;
  long k, nq, nr;
  FILE *results;
  struct thrd_data *t_arg;
  pthread_t *thread_id;
  pthread_attr_t attr;

  /* Pthreads setup: initialize barrier and explicitly create
   threads in a joinable state (for portability)  */
  mylib_barrier_init(&barrier);
  pthread_attr_init(&attr);
  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

  /* ask to enter n and n_threads from the user */
  printf ("enter the range n = ");
  scanf ("%ld", &n);
  printf ("enter the number of threads n_threads = ");
  scanf ("%ld", &n_threads);
  time_t start = time(0);//set initial time
  //Initialize global list
  GlobalList=(bool *)malloc(sizeof(bool)*n);
  for(i=0;i<n;i++)
    GlobalList[i]=0;
  /* create arrays of thread ids and thread args */
  thread_id = (pthread_t *)malloc(sizeof(pthread_t)*n_threads);
  t_arg = (struct thrd_data *)malloc(sizeof(struct thrd_data)*n_threads);

  /* distribute load and create threads for computation */
  nq = n / n_threads;
  nr = n % n_threads;

  k = 1;
  Num_Threads=n_threads;
  for (i=0; i<n_threads; i++){
    t_arg[i].id = i;
    t_arg[i].start = k;
    if (i < nr)
        k = k + nq + 1;
    else
        k = k + nq;
    t_arg[i].end = k-1;
    pthread_create(&thread_id[i], &attr, DoSieve, (void *) &t_arg[i]);
  }

  /* Wait for all threads to complete then print all prime numbers */
  for (i=0; i<n_threads; i++) {
    pthread_join(thread_id[i], NULL);
  }
  int j=1;
  //Get the spent time for the computation works by all participanting threads
  time_t stop = time(0);
  printf("Time to do everything except print = %lu seconds\n", (unsigned   long)    (stop-start));
  //print the result of prime numbers
  printf("The prime numbers are listed below:\n");
  for (i = 1; i < n; i++)
  {
    if (GlobalList[i] == 0)
    {
        printf("%ld ", i + 1);
        j++;
    }
    if (j% 15 == 0)
        printf("\n");
  }
  printf("\n");
  // Clean up and exit 
  free(GlobalList);
  pthread_attr_destroy(&attr);
  mylib_barrier_destroy(&barrier); // destroy barrier object
  pthread_exit (NULL);
}

@EOF The only shared variable in my program is the global array, but as each thread just calculate a specified sub-range of it, which is not overlapped with that of others, there wouldn't be the possible concurrency issue as you concerned. — Ivan, May 24 '16 at 12:45
@EOF Could you explain it with more details? I don't see what could lead to the situation you concern. — Ivan, May 25 '16 at 03:00
It would help if you documented what data a mutex is expected to protect. Also, what possible issues that are common in parallizing work have you ruled out already? — Ulrich Eckhardt, May 27 '16 at 07:08

Alen Vrečko · Accepted Answer · 2016-05-24T12:04:49.420

You make a valid observation. More threads doesn't mean more work gets done.

You are running you program on a dual-core CPU. You already saturate the system with 2 threads.

With 1 thread only 1 core will get used. With 2 threads 2 cores will get used. With let say 4 threads you will see about the same performance as with 2 threads. Hyper-threading doesn't help because a logical core (HT core) shares the memory system with it's physical core.

Here is the output of running

perf stat -d sieve

      23879.553188      task-clock (msec)         #    1.191 CPUs utilized          
             3,666      context-switches          #    0.154 K/sec                  
             1,470      cpu-migrations            #    0.062 K/sec                  
           219,177      page-faults               #    0.009 M/sec                  
    76,070,790,848      cycles                    #    3.186 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    34,500,622,236      instructions              #    0.45  insns per cycle        
     4,172,395,541      branches                  #  174.727 M/sec                  
         1,020,010      branch-misses             #    0.02% of all branches        
    21,065,385,093      L1-dcache-loads           #  882.152 M/sec                  
     1,223,920,596      L1-dcache-load-misses     #    5.81% of all L1-dcache hits  
        69,357,488      LLC-loads                 #    2.904 M/sec                  
   <not supported>      LLC-load-misses:HG

This is the output of i5-4460 CPU's hardware performance monitor. It tracking some interesting statistics.

Notice the low instructions per cycle count. The cpu is doing 0.45 instructions per cycle. Normally you want to see this value > 1.

Update: The key here to notice is that increasing the number of threads doesn't help. The CPU can only do a finite amount of branching and memory access.

correlation does not imply causation - one would have to inspect his actual program in its entirety in order to pinpoint the cause .... low instruction-rate does not necessarily mean high memory-usage — specializt, May 24 '16 at 10:29
So can I say that the optimal threads used to run my program on a certain machine should be exactly how many cores that machine has? — Ivan, May 24 '16 at 13:55
... certainly **not** but its a good starting point - the real optimal number of threads depends on a huge amount of factors like CPU cache size, operating system load, used memory ... — specializt, May 24 '16 at 14:22

score -1 · Answer 2 · edited May 23 '17 at 11:58

Two observations.

First, if you fix your sieve code then it should run about 25 times as fast as it does now, corresponding roughly to the expected gain from distributing your current code successfully over 32 cores.

Have a look at prime number summing still slow after using sieve where I showed how to sieve the numbers up to 2,000,000,000 in 1.25 seconds in C# of all languages. The article discusses (and benchmarks) each step/technique separately so that you pick what you like and roll a solution that strikes the perfect bang/buck ratio for your needs. Things will be even faster in C/C++ because there you can count on the compiler sweating the small stuff for you (at least with excellent compilers like gcc or VC++).

Second: when sieving large ranges the most important resource is the level 1 cache of the CPU. Everything else plays second fiddle. You can see this also from the benchmarks in my article. To distribute a sieving task across several CPUs, count the L1 caches in your system and assign a sieving job to each cache (1/kth of the range where k is the number of L1 caches). This is a bit of a simplification since you'd normally choose a finer granularity for the size of the work items, but it gives the general idea.

I said 'caches', not 'cores', 'virtual cores' or 'threads' because that is precisely what you need to do: assign the jobs such that each job has its own L1 cache. How that works depends not only on the operating system but also on the specific CPU(s) in your system. If two 'whatevers' share an L1 cache, give a job to only one of the two and ignore the other (or rather, set the affinity for the job such that it can run on any one of the two but nowhere else).

This is easy enough to do with operating system APIs (e.g. Win32) but I don't know enough about pthreads to tell whether it offers the required precision. As a first approximation you could match the number of threads to the suspected number of L1 caches.

Overall well written, with just two caveats: Firstly, tune down that angry tone of yours. Secondly, your last sentence is just FUD and your answer would be better of without it. I don't blame you for not knowing much about pthreads but then you shouldn't make such statements. — Ulrich Eckhardt, May 27 '16 at 07:07
@Ulrich: don't see 'angry', I was only trying to emphasise the two important memes. Anyway, please feel free to edit away whatever you feel is inappropriate (English is not my native language, and it is a personal fault of mine that I tend to focus on the 'what', not the 'how' - i.e. the exact opposite of 'form over function'). — DarthGizka, May 27 '16 at 07:14

Sieve of Eratosthenes Pthread implementation: thread number doesn't affect computation time

2 Answers2