Is prefix scan CUDA sample code in gpugems3 correct?

Question

I've written a piece of code to call the kernel in the book GPU Gems 3, Chapter 39: Parallel Prefix Sum (Scan) with CUDA.

However the results that I get are a bunch of negative numbers instead of prefix scan.

Is my kernel call wrong or is there something wrong with the code from the GPU Gems 3 book?

Here is my code:

#include <stdio.h>
#include <sys/time.h>
#include <cuda.h>

__global__ void kernel(int *g_odata, int  *g_idata, int n, int dim)
{
    extern __shared__ int temp[];// allocated on invocation
    int thid = threadIdx.x;
    int offset = 1;

    temp[2*thid] = g_idata[2*thid]; // load input into shared memory
    temp[2*thid+1] = g_idata[2*thid+1];
    for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
    {
    __syncthreads();
    if (thid < d)
    {
    int ai = offset*(2*thid+1)-1;
    int bi = offset*(2*thid+2)-1;
    temp[bi] += g_idata[ai];
    }
    offset *= 2;
    }
    if (thid == 0) { temp[n - 1] = 0; } // clear the last element
    for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
    {
    offset >>= 1;
    __syncthreads();
    if (thid < d)
    {
    int ai = offset*(2*thid+1)-1;
    int bi = offset*(2*thid+2)-1;
    int t = temp[ai];
    temp[ai] = temp[bi];
    temp[bi] += t;
    }
    }
    __syncthreads();
    g_odata[2*thid] = temp[2*thid]; // write results to device memory
    g_odata[2*thid+1] = temp[2*thid+1];
}

void Initialize(int  *h_in,int num_items)
{
    int j;
    for(j=0;j<num_items;j++)

        h_in[j]=j;
        printf(" input: ");
            printf("\n\n");
}

int main(int argc, char** argv)
{
    int num_items = 512;

    int*  h_in = new int[num_items];

    // Initialize problem 
    Initialize(h_in, num_items);

    int *d_in = NULL;
    cudaMalloc((void**)&d_in, sizeof(int) * num_items);

    if(cudaSuccess != cudaMemcpy(d_in, h_in, sizeof(int) * num_items, cudaMemcpyHostToDevice)) fprintf(stderr,"could not copy to gpu");

    // Allocate device output array
    int *d_out = NULL;
    cudaMalloc((void**)&d_out, sizeof(int) * (num_items+1));

    kernel<<<1,256,num_items*sizeof(int)>>>(d_out, d_in,num_items, 2);

    int* h_out= new int[num_items+1];
    if(cudaSuccess != cudaMemcpy(h_out,d_out,sizeof(int)*(num_items+1),cudaMemcpyDeviceToHost))fprintf(stderr,"could not copy back");
    int i;
    printf(" \n");
    for(i=0;i<num_items;i++)
    printf(" ,%d ",h_out[i]);
    // Cleanup
    if (h_in) delete[] h_in;
    if (h_out) delete[] h_out;
    if (d_in) cudaFree(d_in);
    if (d_out) cudaFree(d_out);

    printf("\n\n");

    return 0;
}

Your kernel is using dynamic shared memory and you even have comment "allocated on invocation", but you don't allocate shared memory on kernel invocation, it should be the 3rd parameter to kernel invocation, see http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#execution-configuration — Maxim Milakov, Jun 14 '15 at 17:16
So I allocated shared memory on kernel invocation by calling kernel<<< 1, 256, 256*sizeof(int) >>>(g_odata, g_idata, n) but still I get some negative numbers as my results. Does anyone know what I'm doing wrong? — dibid, Jun 14 '15 at 18:26
Please post the *actual* code you are now running into your question. There were a lot of serious errors in the version in the question you posted, but clearly that is not what you are running now.... — talonmies, Jun 14 '15 at 18:41
@dibid: Yes, but the code you posted has a kernel invocation with the number of threads and blocks reversed and no shared memory allocation. That isn't what you are actually using, is it? And the GPU gems code didn't have 20 lines of commented code in it, did it? So all I am asking is that you show the exact code you now have, with the exact output and an exact description of the current problem. — talonmies, Jun 14 '15 at 20:02
@talonmies I've edited my code, that is my exact code and here is my out put: ,-513 ,-1 ,-1 ,-67585 ,-33554465 ,-8705 ,-1 ,-134218241 ,-2049 ,-1 ,-1 ,-1 ,-1 ,-1 ,-1 ,... my current problem is that I don't know why this code is giving me such result. It should be prefix sum for 512 items. — dibid, Jun 14 '15 at 20:14
You still are not launching the kernel correctly. Look at the indexing of the shared memory array in the kernel, and look at how much shared memory you are launching the kernel with. The kernel is never running because of out of bounds memory access. — talonmies, Jun 14 '15 at 20:17
@talonmies I see, so what exactly should it be? 2*(number of threads+1) ? — dibid, Jun 14 '15 at 20:20
You really need to start using [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). — Robert Crovella, Jun 14 '15 at 20:33

score 6 · Accepted Answer · edited May 23 '17 at 12:14

It seems that you've made at least 1 error in transcribing the code from the GPU Gems 3 chapter into your kernel. This line is incorrect:

temp[bi] += g_idata[ai];

it should be:

temp[bi] += temp[ai];

When I make that one change to the code you have now posted, it seems to print out the correct (exclusive-scan) prefix sum for me. There's a few other things I would mention:

Even without that change, I get some results that are close to correct. So if you're getting widely different stuff (e.g. negative numbers) you may have a problem with your machine setup or CUDA install. I would suggest using more rigorous cuda error checking than what you have now (although a machine setup problem should have been indicated in one of your checks.)
The routine as crafted will have some limitations. It can only be used in a single threadblock, it will have bank conflicts on shared memory access, and it will be limited in data set size to what can be handled by a single threadblock (this routine produces two output elements per thread, so the data set size is expected to be equal to twice the number of threads). As has been already covered, the dynamic shared memory allocation needs to be as large as the data set size (ie. twice the thread size, in number of elements).
This may be useful for learning, but if you want a robust, fast prefix scan, you are advised to use a routine from thrust or cub instead of your own code, even if derived from this (old) article.

The following code is similar to yours, but it has the above issues fixed, and I have templated the kernel for use with various datatypes:

#include <stdio.h>
#define DSIZE 512
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


typedef int mytype;

template <typename T>
__global__ void prescan(T *g_odata, T *g_idata, int n)
{
  extern __shared__ T temp[];  // allocated on invocation
  int thid = threadIdx.x;
  int offset = 1;
  temp[2*thid] = g_idata[2*thid]; // load input into shared memory
  temp[2*thid+1] = g_idata[2*thid+1];
  for (int d = n>>1; d > 0; d >>= 1)                    // build sum in place up the tree
  {
    __syncthreads();
    if (thid < d)
    {
      int ai = offset*(2*thid+1)-1;
      int bi = offset*(2*thid+2)-1;
      temp[bi] += temp[ai];
    }
    offset *= 2;
  }
  if (thid == 0) { temp[n - 1] = 0; } // clear the last element
  for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
    {
      offset >>= 1;
      __syncthreads();
      if (thid < d)
      {
         int ai = offset*(2*thid+1)-1;
         int bi = offset*(2*thid+2)-1;
         T t = temp[ai];
         temp[ai] = temp[bi];
         temp[bi] += t;
      }
    }
  __syncthreads();
  g_odata[2*thid] = temp[2*thid]; // write results to device memory
  g_odata[2*thid+1] = temp[2*thid+1];
}

int main(){

  mytype *h_i, *d_i, *h_o, *d_o;
  int dszp = (DSIZE)*sizeof(mytype);

  h_i = (mytype *)malloc(dszp);
  h_o = (mytype *)malloc(dszp);
  if ((h_i == NULL) || (h_o == NULL)) {printf("malloc fail\n"); return 1;}
  cudaMalloc(&d_i, dszp);
  cudaMalloc(&d_o, dszp);
  cudaCheckErrors("cudaMalloc fail");
  for (int i = 0 ; i < DSIZE; i++){
    h_i[i] = i;
    h_o[i] = 0;}
  cudaMemset(d_o, 0, dszp);
  cudaCheckErrors("cudaMemset fail");
  cudaMemcpy(d_i, h_i, dszp, cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy 1 fail");
  prescan<<<1,DSIZE/2, dszp>>>(d_o, d_i, DSIZE);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel fail");
  cudaMemcpy(h_o, d_o, dszp, cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy 2 fail");
  mytype psum = 0;
  for (int i =1; i < DSIZE; i++){
    psum += h_i[i-1];
    if (psum != h_o[i]) {printf("mismatch at %d, was: %d, should be: %d\n", i, h_o[i], psum); return 1;}
    }
  return 0;
}

Invalid device function usually means that you are compiling for an incorrect architecture. What GPU are you using, what Operating System and version are you using, and what is the exact compile command you are using? (This may be an explanation for point 1 in my answer, since my code does proper cuda error checking.) — Robert Crovella, Jun 15 '15 at 15:23
Sorry, "Tesla" isn't a sufficient description. I need to know the exact GPU you are using. Run the deviceQuery sample code if you're not sure. — Robert Crovella, Jun 15 '15 at 15:40
You could also probably just omit the `-arch=sm_35` switch from your compile command, and this code should work on any GPU you have, assuming it is not [a very old one that is not supported by CUDA 7](http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#unsupported-features). — Robert Crovella, Jun 15 '15 at 15:53
I used >lspci | grep VGA and it gave me: NVIDIA Corporation GF100GL [Quadro 5000] (rev a3) 09:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. G200eR2 (rev 01) 83:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 680] (rev a1) 84:00.0 VGA compatible controller: NVIDIA Corporation GF100GL [Tesla C2050 / C2070] (rev a3) deviceQuery did not work — dibid, Jun 15 '15 at 16:14
Yes it did! Thank you very much :) any idea why I should omit arch=sm_35? — dibid, Jun 15 '15 at 16:18
None of your GPUs are cc 3.5 GPUs. so `sm_35` is not the correct architecture to specify. Quadro5000 is cc2.0 GPU. GTX680 is a cc3.0 GPU. Tesla C2050/C2070 is a cc2.0 GPU. deviceQuery should work, however. All of those GPUs are supported by CUDA 7. — Robert Crovella, Jun 15 '15 at 16:37
@RobertCrovella I was wondering why this code can only be used in a single threadblock? — dibid, Jun 19 '15 at 05:08
It's fairly evident, if you read the gpugems chapter, and consider lines of code such as this one: `temp[2*thid] = g_idata[2*thid]; // load input into shared memory` that it is only designed/intended to be used in a single threadblock. Ask yourself how `thid` will work if you run with 2 threadblocks for example. That's just one example. I don't know all the issues with this code that would prevent it from use in multiple threadblocks. — Robert Crovella, Jun 19 '15 at 05:15
@RobertCrovella I just have one last question, Why do we use n/2 threads per block for loading data into shared memory? — dibid, Jun 26 '15 at 18:26
The initial step of the scan requires each (active) thread to look at two data elements from the input. Therefore there is a 2:1 relationship between input data and active threads. In order to keep each thread (initially) active, each thread loads 2 input elements. — Robert Crovella, Jun 26 '15 at 18:33
Is there any reason that your sample code works only for DSIZE values that are a multiple of 32? Especially since the sample code just uses a single threadblock (https://stackoverflow.com/a/26611959/4647107). — Pratyush Das, Feb 04 '20 at 19:20
The sample code is designed to demonstrate a concept. It should be possible to get it to work with more-or-less arbitrary data sizes, but it would certainly require some rewriting. There are a number of things going on in the kernel that expect a power-of-2 sweep operation, and therefore shared memory sized appropriately, even if the data size is not a power-of-2. That is one example of how it would have to be refactored. — Robert Crovella, Feb 04 '20 at 19:30

Is prefix scan CUDA sample code in gpugems3 correct?

1 Answers1