CUDA : stencil code not working

Question

First : I'm sorry if this topic is not well done (it's my first topic) I'm currently trying to learn the GPU compute on NVIDIA but I've a problem with the __syncthreads() method from CUDA, i think it's doesn't work. I've try to shearch on web and I've not found a fix.

__global__ void stencil_1d(int *in, int *out) {
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];   // Création de la mémoire partagée avec tout les threads d'un même block

    int gindex = threadIdx.x + blockIdx.x * blockDim.x;
    int lindex = threadIdx.x + RADIUS;

    /*for (int i = 0; i < N*sizeof(int); i++)
        printf("%i ---i=%i \n", in[i], i);*/

    // Déplacer les éléments d'entrées sur la mémoire partagée
    temp[lindex] = in[gindex];
    if (threadIdx.x < RADIUS) {
        temp[lindex - RADIUS] = in[gindex - RADIUS]; // Récuprère le Halo avant les valeurs du block
        temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; // Récupère le Halo après les valeurs du block
    }

    __syncthreads(); // Synchronisation des Threads d'un même Block

    int result = 0;
    for (int offset = -RADIUS; offset <= RADIUS ; offset++)
        result += temp[lindex + offset];

    out[gindex] = result;
}

When I uncomment the for the program work properly, but currently without the for my pogramm return -842150451 in the out variable.

The main code :

int main()
{
    int size = N * sizeof(int);

    /******************** Utilisation de la fonction GPU stencil_1d ********************/

    int *in, *out; // Variable sur la mémoire Host
    int *d_in, *d_out;  //Variable sur la mémoire Device

    // Allocation de la mémore aux variables sur le Device
    cudaMalloc((void**)&d_in, size);
    cudaMalloc((void**)&d_out, size);

    // Allocation de la mémoire aux variables de Host
    in = (int*)malloc(size); random_ints(in, N);
    out = (int*)malloc(size);

    // Copie des valeurs des variables de Host vers Device
    cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

    // Exécution de la fonction sur le Device (ici 3 Blocks, 10 Threads)
    stencil_1d <<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_in, d_out);

    // Récupération de la variable out de Device vers Host
    cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

    // Affichage du résultat
    for(int i=0; i<size; i++)
        printf("%i ---i=%i \n", out[i], i);

    // Libération de la mémoire prise par les variables sur Host
    free(in); free(out);

    // Libération de la mémoire prise par les variables sur le Device
    cudaFree(d_in); cudaFree(d_out);

    return 0;
}

If forgot that : The define :

#define N 30
#define THREADS_PER_BLOCK 10
#define BLOCK_SIZE (N/THREADS_PER_BLOCK)
#define RADIUS 3

The random_ints code:

void random_ints(int *var, int n) // Attribue une valeur à toutes le composantes des variables
{
    int i;
    for (i = 0; i < n; i++)
        var[i] = 1;
}

Thanks by advance for your answers.

It would be easier to read if you can translate the comment to English... — hkchengrex, Aug 22 '18 at 15:03
please provide a complete code, we would like to know what `N`, `THREADS_PER_BLOCK`, `BLOCK_SIZE`, `RADIUS`, `random_ints` etc. are. Provide a code that someone else could compile,without having to add anything. — Robert Crovella, Aug 22 '18 at 15:03
" I've a problem with the *whatever* method from *whatever_library*, i think it's doesn't work" - that's a common misunderstanding by beginners. Usually whatever library/API you use works just fine - you *can* of course be the first person to spot a problem, but that's a lot rarer than you simply not understanding how to use the library/API properly. The bug is *much more* likely to be in your code / your usage of the lib, than in the lib. — Jesper Juhl, Aug 22 '18 at 15:06

Robert Crovella · Accepted Answer · 2018-08-23T00:19:03.127

This code was originally designed for teaching; it has some defects in it.

First, any time you are having trouble with a CUDA code, I recommend proper CUDA error checking and run your code with cuda-memcheck (see my example use of cuda-memcheck below). If you do this before asking others for help, and provide the error info in your question, it may help others help you.

If you run this code with cuda-memcheck it will indicate errors both in access of global memory and shared memory.

Your choice of BLOCK_SIZE is not correct. This should be set equal to THREADS_PER_BLOCK, not (N/THREADS_PER_BLOCK). It appears you intended to run this kernel with 3 blocks of 10 threads each, so we'll work with that.

These lines will index out-of-bounds:

if (threadIdx.x < RADIUS) {
    temp[lindex - RADIUS] = in[gindex - RADIUS]; // Récuprère le Halo avant les valeurs du block
    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; // Récupère le Halo après les valeurs du block

for example, in the first block, with the first thread (threadIdx.x is 0), gindex is 0, so the calculation gindex - RADIUS will be -3. That can't be correct.

This for-loop is incorrect:

for(int i=0; i<size; i++)

it should be:

for(int i=0; i<N; i++)

When I fix those things, your code runs without error and produces a sensible result for me:

$ cat t280.cu
#define N 30
#define THREADS_PER_BLOCK 10
#define BLOCK_SIZE THREADS_PER_BLOCK
#define RADIUS 3

#include <stdio.h>

void random_ints(int *var, int n) // Attribue une valeur à toutes le composantes des variables
{
    int i;
    for (i = 0; i < n; i++)
        var[i] = 1;
}

__global__ void stencil_1d(int *in, int *out) {
    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];   // Création de la mémoire partagée avec tout les threads d'un même block

    int gindex = threadIdx.x + blockIdx.x * blockDim.x;
    int lindex = threadIdx.x + RADIUS;

    /*for (int i = 0; i < N*sizeof(int); i++)
        printf("%i ---i=%i \n", in[i], i);*/

    // Déplacer les éléments d'entrées sur la mémoire partagée
    temp[lindex] = in[gindex];
    if (threadIdx.x < RADIUS) {
        temp[lindex - RADIUS] = (gindex >= RADIUS)?in[gindex - RADIUS]:0; // Récuprère le Halo avant les valeurs du block
        temp[lindex + BLOCK_SIZE] = ((gindex + BLOCK_SIZE)<N)?in[gindex + BLOCK_SIZE]:0; // Récupère le Halo après les valeurs du block
    }

    __syncthreads(); // Synchronisation des Threads d'un même Block

    int result = 0;
    for (int offset = -RADIUS; offset <= RADIUS ; offset++)
        result += temp[lindex + offset];

    out[gindex] = result;
}

int main()
{
    int size = N * sizeof(int);

    /******************** Utilisation de la fonction GPU stencil_1d ********************/

    int *in, *out; // Variable sur la mémoire Host
    int *d_in, *d_out;  //Variable sur la mémoire Device

    // Allocation de la mémore aux variables sur le Device
    cudaMalloc((void**)&d_in, size);
    cudaMalloc((void**)&d_out, size);

    // Allocation de la mémoire aux variables de Host
    in = (int*)malloc(size); random_ints(in, N);
    out = (int*)malloc(size);

    // Copie des valeurs des variables de Host vers Device
    cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

    // Exécution de la fonction sur le Device (ici 3 Blocks, 10 Threads)
    stencil_1d <<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_in, d_out);

    // Récupération de la variable out de Device vers Host
    cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

    // Affichage du résultat
    for(int i=0; i<N; i++)
        printf("%i ---i=%i \n", out[i], i);

    // Libération de la mémoire prise par les variables sur Host
    free(in); free(out);

    // Libération de la mémoire prise par les variables sur le Device
    cudaFree(d_in); cudaFree(d_out);

    return 0;
}
$ nvcc -o t280 t280.cu
$ cuda-memcheck ./t280
========= CUDA-MEMCHECK
4 ---i=0
5 ---i=1
6 ---i=2
7 ---i=3
7 ---i=4
7 ---i=5
7 ---i=6
7 ---i=7
7 ---i=8
7 ---i=9
7 ---i=10
7 ---i=11
7 ---i=12
7 ---i=13
7 ---i=14
7 ---i=15
7 ---i=16
7 ---i=17
7 ---i=18
7 ---i=19
7 ---i=20
7 ---i=21
7 ---i=22
7 ---i=23
7 ---i=24
7 ---i=25
7 ---i=26
6 ---i=27
5 ---i=28
4 ---i=29
========= ERROR SUMMARY: 0 errors
$

The reason we get 4,5,6 at each end of the stencil output is due to the limitations we placed in the kernel for item 2 above, to avoid out-of-bounds indexing. You could change this boundary behavior, if you want.

One more comment: Right now your code chooses N and THREADS_PER_BLOCK so that it is evenly divisible. As long as you do that (and adhere to other limits such as maximum threads per block of 1024) you should be OK with this code. For full flexibility other changes should be made, however what I've described here should be enough to get you past these errors.

Thx so much for your help, i fix those things and its working ! sorry for my beginner level. — Yoann clot, Aug 23 '18 at 06:55

CUDA : stencil code not working

1 Answers1