4

I am writing my first program in CUDA. It is a prime number generator. It works, but it is only 50% faster than the equivalent single threaded C++ code. The CPU version uses 100% of one core. The GPU version uses only 20% of the GPU. The CPU is an i5 (2310). The GPU is a GF104.

How can I improve the performance of this algorithm?

My complete program follows.

int* d_C;

using namespace std;

__global__ void primo(int* C, int N, int multi)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < N) 
  {
    if(i%2==0||i%3==0||i%5==0||i%7==0)
    {
      C[i]=0;           
    }
    else
    {
      C[i]=i+N*multi;
    }
  }
}

int main()
{
  cout<<"Prime numbers \n";
  int N=1000;
  int h_C[1000];
  size_t size=N* sizeof(int);
  cudaMalloc((void**)&d_C, size);

  int threadsPerBlock = 1024;
  int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
  vector<int> lista(100000000);
  int c_z=0;

  for(int i=0;i<100000;i++)
  {
    primo<<<blocksPerGrid, threadsPerBlock>>>(d_C, N,i);    
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);         
    for(int c=0;c<N;c++)
    {   
      if(h_C[c]!=0)
      {
        lista[c+N*i-c_z]=h_C[c];
      }
      else
      {
        c_z++;
      }
    }   
  }
  lista.resize(lista.size()-c_z+1);
  return(0);
}

I tried using a 2D array and a for loop in the kernel, but was unable to get the correct results.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
zeeck
  • 53
  • 4

1 Answers1

3

Welcome to Stack Overflow.

Here are some potential issues:

  • N = 1000 is too low. Since you have 1024 threadsPerBlock, your kernel will run only one block, which is not enough to utilize the GPU. Try N = 1000000, so that your kernel launch is for nearly 1000 blocks.

  • You're doing very little work on the GPU (4 modulus operations per number tested). So it's probably faster to do those operations on the CPU than it is to copy them from the GPU (over the PCIe bus).

To make it worthwhile to use the GPU for finding the prime numbers, I think you need to implement the entire algorithm on the GPU, instead of just the modulus operations.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82