CUDA - Sieve of Eratosthenes division into parts

Question

I'm writing implementation of Sieve of Eratosthenes (https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) on GPU. But no sth like this - http://developer-resource.blogspot.com/2008/07/cuda-sieve-of-eratosthenes.html

Method:

Creating n-element array with default values 0/1 (0 - prime, 1 - no) and passing it on GPU (I know that it can be done directly in kernel but it's not problem in this moment).
Each thread in block checks multiples of a single number. Each block checks in total sqrt(n) possibilities. Each block == different interval.
Marking multiples as 1 and passing data back to the host.

Code:

#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024

__global__ void kernel(int *global, int threads) {
    extern __shared__ int cache[];

    int tid = threadIdx.x + 1;
    int offset = blockIdx.x * blockDim.x;
    int number = offset + tid;

    cache[tid - 1] = global[number];
    __syncthreads();

    int start = offset + 1;
    int end = offset + threads;

    for (int i = start; i <= end; i++) {
        if ((i != tid) && (tid != 1) && (i % tid == 0)) {
            cache[i - offset - 1] = 1;
        }
    }
    __syncthreads();
    global[number] = cache[tid - 1];
}


int main(int argc, char *argv[]) {
    int *array, *dev_array;
    int n = atol(argv[1]);
    int n_sqrt = floor(sqrt((double)n));

    size_t array_size = n * sizeof(int);
    array = (int*) malloc(n * sizeof(int));
    array[0] = 1;
    array[1] = 1;
    for (int i = 2; i < n; i++) {
        array[i] = 0;
    }

    cudaMalloc((void**)&dev_array, array_size);
    cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);

    int threads = min(n_sqrt, THREADS);
    int blocks = n / threads;
    int shared = threads * sizeof(int);
    kernel<<<blocks, threads, shared>>>(dev_array, threads);
    cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);

    int count = 0;
    for (int i = 0; i < n; i++) {
        if (array[i] == 0) {
            count++;
        }
    }
    printf("Count: %d\n", count);
    return 0;
}

Run: ./sieve 10240000

It works correctly when n = 16, 64, 1024, 102400... but for n = 10240000 I getting incorrect result. Where is problem?

Any time you are having trouble with a CUDA code, it's a good idea to use [proper cuda error checking](http://stackoverflow.com/questions/14038589) and also run your code with `cuda-memcheck`. You **really** should do that **before** posting here. When I run your code with n=10240000 cuda-memcheck reports an invalid global read of size 4 from the kernel. If you use that information plus the method described [here](http://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218?s=1|0.3141#27278218) you may be able to pinpoint the problem. — Robert Crovella, Jun 22 '15 at 19:45

score 2 · Answer 1 · answered Jun 21 '15 at 10:38

2

There are a couple of issues, I think, but here's a pointer to the actual problem: The sieve of Eratosthenes removes iteratively multiples of already encountered prime numbers, and you want to separate the work-load into thread-blocks, where each thread-block operates on a piece of shared memory (cache, in your example). Thread-blocks, however, are generally independent from all other thread-blocks and cannot easily communicate with one another. One example to illustrate the problem: The thread with index 0 in thread-block with index 0 removes multiples of 2. Thread blocks with index > 0 have no way to know about this.

answered Jun 21 '15 at 10:38

Peter Klenner

21
1

1

would marking the multiples of odds, instead of primes, help (or 2-3-5... coprimes, i.e. wheel)? – Will Ness Jun 21 '15 at 10:53
1

another way to deal with that might be to work in segments, and use the already known primes to sieve the next block. – Will Ness Jun 21 '15 at 10:57
@Peter Klenner, thx but looks on my edited post. I added example what I want to achieve. In my opinion communication between thread-blocks isn't needed. Each thread-blocks works on different interval. – Bakus123 Jun 21 '15 at 11:19
@Will Ness, I don't mark primes only multiples. When we have n = 64 is sqrt(64) = 8 numbers to check. And I want to each thread-blocks works on different interval and threadIdx == multiple. – Bakus123 Jun 21 '15 at 11:28
1

block 1 covers interval 10 - 17 actually. Your loop `for (int i = start; i < end; i += tid) { cache[i - offset] = 1; }` is not suited to detect multiples of primes. For example, number 10 is discounted simply because its the first entry in this thread-block. You would need the modulo operator and extend the loop over the whole segment for each thread. Off the top of my head like this: `for (int i = start; i < end; i++) { if(i%tid==0){ cache[i - offset] = 1;} }` – Peter Klenner Jun 22 '15 at 07:46

score 2 · Accepted Answer · edited May 23 '17 at 12:17

This code has a variety of problems, in my view.

You are fundamentally accessing items out of range. Consider this sequence in your kernel:
```
int tid = threadIdx.x + 1;
int offset = blockIdx.x * blockDim.x;
int number = offset + tid;

cache[tid - 1] = global[number];
```
You (in some cases -- see below) have launched a thread array exactly equal in size to your global array. So what happens when the highest numbered thread runs the above code? number = threadIdx.x+1+blockIdx.x*blockDim.x. This number index will be one beyond the end of your array. This is true for many possible values of n. This problem would have been evident to you if you had either used proper cuda error checking or had run your code with cuda-memcheck. You should always do those things when you are having trouble with a CUDA code and also before asking for help from others.
The code only has a chance of working correctly if the input n is a perfect square. The reason for this is contained in these lines of code (as well as dependencies in the kernel):
```
int n = atol(argv[1]);
int n_sqrt = floor(sqrt((double)n));
...
int threads = min(n_sqrt, THREADS);
int blocks = n / threads;
```
(note that the correct function here would be atoi not atol, but I digress...) Unless n is a perfect square, the resultant n_sqrt will be somewhat less than the actual square root of n. This will lead you to compute a total thread array that is smaller than the necessary size. (It's OK if you don't believe me at this point. Run the code I will post below and input a size like 1025, then see if the number of threads * blocks is of sufficient size to cover an array of 1025.)
As you've stated:

Each block checks in total sqrt(n) possibilities.

Hopefully this also points out the danger of non-perfect square n, but we must now ask "what if n is larger than the square of the largest threadblock size (1024)? The answer is that the code will not work correctly in many cases - and your chosen input of 10240000, although a perfect square, exceeds 1024^2 (1048576) and it does not work for this reason. Your algorithm (which I claim is not a Sieve of Eratosthenes) requires that each block be able to check sqrt(n) possibilities, just as you stated in the question. When that no longer becomes possible because of the limits of threads per block, then your algorithm starts to break.

Here is a code that makes some attempt to fix issue #1 above, and at least give an explanation for the failures associated with #2 and #3:

#include <stdio.h>
#include <stdlib.h>
#define THREADS 1024
#define MAX 10240000

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


__global__ void kernel(int *global, int threads) {
    extern __shared__ int cache[];

    int tid = threadIdx.x + 1;
    int offset = blockIdx.x * blockDim.x;
    int number = offset + tid;

    if ((blockIdx.x != (gridDim.x-1)) || (threadIdx.x != (blockDim.x-1))){
      cache[tid - 1] = global[number];
      __syncthreads();

      int start = offset + 1;
      int end = offset + threads;

      for (int i = start; i <= end; i++) {
        if ((i != tid) && (tid != 1) && (i % tid == 0)) {
            cache[i - offset - 1] = 1;
        }
      }
      __syncthreads();
      global[number] = cache[tid - 1];}
}


int cpu_sieve(int n){
    int limit = floor(sqrt(n));
    int *test_arr = (int *)malloc(n*sizeof(int));
    if (test_arr == NULL) return -1;
    memset(test_arr, 0, n*sizeof(int));
    for (int i = 2; i < limit; i++)
      if (!test_arr[i]){
        int j = i*i;
        while (j <= n){
          test_arr[j] = 1;
          j += i;}}
    int count = 0;
    for (int i = 2; i < n; i++)
      if (!test_arr[i]) count++;
    return count;
}

int main(int argc, char *argv[]) {
    int *array, *dev_array;
    if (argc != 2) {printf("must supply n as command line parameter\n"); return 1;}
    int n = atoi(argv[1]);
    if ((n < 1) || (n > MAX)) {printf("n out of range %d\n", n); return 1;}
    int n_sqrt = floor(sqrt((double)n));

    size_t array_size = n * sizeof(int);
    array = (int*) malloc(n * sizeof(int));
    array[0] = 1;
    array[1] = 1;
    for (int i = 2; i < n; i++) {
        array[i] = 0;
    }

    cudaMalloc((void**)&dev_array, array_size);
    cudaMemcpy(dev_array, array, array_size, cudaMemcpyHostToDevice);

    int threads = min(n_sqrt, THREADS);
    int blocks = n / threads;
    int shared = threads * sizeof(int);
    printf("threads = %d, blocks = %d\n", threads, blocks);
    kernel<<<blocks, threads, shared>>>(dev_array, threads);
    cudaMemcpy(array, dev_array, array_size, cudaMemcpyDeviceToHost);
    cudaCheckErrors("some error");
    int count = 0;
    for (int i = 0; i < n; i++) {
        if (array[i] == 0) {
            count++;
        }
    }
    printf("Count: %d\n", count);
    printf("CPU Sieve: %d\n", cpu_sieve(n));
    return 0;
}

thanks a lot for all comments. I have a question to the first issue. The if statement what you added is needed? I tried avoid accessing items out of range by change `global[number]` to `global[number - 1]`. But after these changes results are incorrect. What is the problem? — Bakus123, Jun 24 '15 at 20:21
`number-1` will offset all your results by 1. I wouldn't expect that to "work". If you run the code I provided as-is with `cuda-memcheck`, it will produce no errors. If you delete the if statement, and run it with cuda-memcheck, it will indicate errors. So I would say it is needed. — Robert Crovella, Jun 24 '15 at 20:38
thanks for the clarification. I changed `global[number - 1]` and `if (array[i - 1] == 0)` in host and it works also. — Bakus123, Jun 24 '15 at 20:55

CUDA - Sieve of Eratosthenes division into parts

2 Answers2

Linked