CUDA Constant Memory Error

Question

I am trying to do a sample code with constant memory with CUDA 5.5. I have 2 constant arrays of size 3000 each. I have another global array X of size N. I want to compute

Y[tid] = X[tid]*A[tid%3000] + B[tid%3000]

Here is the code.

#include <iostream>
#include <stdio.h>
using namespace std;

#include <cuda.h>



__device__ __constant__ int A[3000];
__device__ __constant__ int B[3000];


__global__ void kernel( int *dc_A, int *dc_B, int *X, int *out, int N)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    if( tid<N )
    {
        out[tid] = dc_A[tid%3000]*X[tid] + dc_B[tid%3000];
    }

}

int main()
{
    int N=100000;

    // set affine constants on host
    int *h_A, *h_B ; //host vectors
    h_A = (int*) malloc( 3000*sizeof(int) );
    h_B = (int*) malloc( 3000*sizeof(int) );
    for( int i=0 ; i<3000 ; i++ )
    {
        h_A[i] = (int) (drand48() * 10);
        h_B[i] = (int) (drand48() * 10);
    }

    //set X and Y on host
    int * h_X = (int*) malloc( N*sizeof(int) );
    int * h_out = (int *) malloc( N*sizeof(int) );
    //set the vector
    for( int i=0 ; i<N ; i++ )
    {
        h_X[i] = i;
        h_out[i] = 0;
    }

    // copy, A,B,X,Y to device
    int * d_X, *d_out;
    cudaMemcpyToSymbol( A, h_A, 3000 * sizeof(int) ) ;
    cudaMemcpyToSymbol( B, h_B, 3000 * sizeof(int) ) ;

    cudaMalloc( (void**)&d_X, N*sizeof(int) ) );
    cudaMemcpy( d_X, h_X, N*sizeof(int), cudaMemcpyHostToDevice ) ;
    cudaMalloc( (void**)&d_out, N*sizeof(int) ) ;



    //call kernel for vector addition
    kernel<<< (N+1024)/1024,1024 >>>(A,B, d_X, d_out, N);
    cudaPeekAtLastError() ;
    cudaDeviceSynchronize() ;


    // D --> H
    cudaMemcpy(h_out, d_out, N * sizeof(int), cudaMemcpyDeviceToHost ) ;


    free(h_A);
    free(h_B);


    return 0;
}

I am trying to run the debugger over this code to analyze. Turns out that on the line which copies to constant memory I get the following error with debugger

Coalescing of the CUDA commands output is off.
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff5c5b700 (LWP 31200)]

Can somebody please help me out with constant memory

The output from the debugger you have shown isn't an error. That is absolutely normal output when a new debugger run is started. — talonmies, Oct 07 '13 at 05:14

score 4 · Accepted Answer · answered Oct 07 '13 at 06:11

There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:

__global__ void kernel(int *X, int *out, int N)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    if( tid<N )
    {
        out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
    }
}

ie. don't try passing A and B to the kernel. The reasons are as follows:

Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions

Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.

Thanks a lot, that worked. Also thanks for sharing that insight on performance. Actually, I am trying to learn to use the constant memory. But will remember your tip for production codes — mkuse, Oct 08 '13 at 05:57

CUDA Constant Memory Error

1 Answers1

Linked