1

I am very new to CUDA and I am trying to initialize an array on the device and return the result back to the host to print out to show if it was correctly initialized. I am doing this because the end goal is a dot product solution in which I multiply two arrays together, storing the results in another array and then summing up the entire thing so that I only need to return the host one value.

In the code I am working on all I am only trying to see if I am initializing the array correctly. I am trying to create an array of size N following the patterns of 1,2,3,4,5,6,7,8,1,2,3....

This is the code that I've written and it compiles without issue but when I run it the terminal is hanging and I have no clue why. Could someone help me out here? I'm so incredibly confused :\

#include <stdio.h>
#include <stdlib.h>
#include <chrono>

#define ARRAY_SIZE 100
#define BLOCK_SIZE 32

__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    __shared__ int temp; 

    if(temp != 8){
        a_d[x] = temp;
        temp++;
     } else {
        a_d[x] = temp;
        temp = 1;
                 } 


}

int main (int argc, char *argv[])
{

//declare pointers for arrays
int *a_d, *b_d, *c_d, *sum_h, *sum_d,a_h[ARRAY_SIZE];

//set space for device variables 
cudaMalloc((void**) &a_d, sizeof(int) * ARRAY_SIZE); 
cudaMalloc((void**) &b_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &c_d, sizeof(int) * ARRAY_SIZE);
cudaMalloc((void**) &sum_d, sizeof(int)); 


    // set execution configuration
        dim3 dimblock (BLOCK_SIZE);
        dim3 dimgrid (ARRAY_SIZE/BLOCK_SIZE);

    // actual computation: call the kernel
        cu_kernel <<<dimgrid, dimblock>>> (a_d,b_d,c_d,ARRAY_SIZE);
    
        cudaError_t result;

   // transfer results back to host
        result = cudaMemcpy (a_h, a_d, sizeof(int) * ARRAY_SIZE, cudaMemcpyDeviceToHost);
        if (result != cudaSuccess) {
            fprintf(stderr, "cudaMemcpy failed.");
            exit(1);
        }

    // print reversed  array
        printf ("Final state of the array:\n");
        for (int i =0; i < ARRAY_SIZE; i++) {
            printf ("%d ", a_h[i]);
        }
        printf ("\n");

}
JustBeginning
  • 33
  • 1
  • 4
  • 4
    There isn't any reason your code should hang. I think you have a system/setup problem. The code you have written will not produce a 1,2,3,4,5,6,7,8,1,2,... pattern as you are suggesting, but that is separate from any sort of hang. You may wish to run your code with `compute-sanitizer` i.e. `compute-sanitizer ./my_code` or else add [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) either of which may help to uncover any system or CUDA install problems. – Robert Crovella Oct 28 '22 at 17:57
  • You were right. After leaving it and coming back it ran without issue. Do you have any suggestions on code to create the pattern? I made some adjustments, including adding a statement to set the value of temp to 1 (set by thread 0). It worked and the value of a_h[0] returned as 1 but every other element of the array was 0. I cannot figure out why since theoretically all the threads should be entering the if statement and setting a_d[x] to something at the very least. Not sure how it is getting 0 :\ – JustBeginning Oct 28 '22 at 18:43

1 Answers1

3

There are at least 3 issues with your kernel code.

  • you are using shared memory variable temp without initializing it.
  • you are not resolving the order in which threads access a shared variable as discussed here.
  • you are imagining (perhaps) a particular order of thread execution, and CUDA provides no guarantees in that area

The first item seems self-evident, however naive methods to initialize it in a multi-threaded environment like CUDA are not going to work. Firstly we have the multi-threaded access pattern, again, Furthermore, in a multi-block scenario, shared memory in one block is logically distinct from shared memory in another block.

Rather than wrestle with mechanisms unsuited to create the pattern you desire, (informed by notions carried over from a serial processing environment), I would simply do something trivial like this to create the pattern you desire:

__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{

    int x = blockIdx.x * blockDim.x + threadIdx.x;
    if (x < size) a_d[x] = (x&7) + 1;  
}

Are there other ways to do it? certainly.

__global__ void cu_kernel (int *a_d,int *b_d,int *c_d, int size)
{

    int x = blockIdx.x * blockDim.x + threadIdx.x;

    __shared__ int temp; 
    if (!threadIdx.x) temp = blockIdx.x*blockDim.x;
    __syncthreads();
    if (x < size) a_d[x] = ((temp+threadIdx.x) & 7) + 1;

}

You can get as fancy as you like.

These changes will still leave a few values at zero at the end of the array, which would require changes to your grid sizing. There are many questions about this already, or study a sample code like vectorAdd.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257