0

I found some difficulty when I try to access a global array from function that's executed from device:

float globTemp[3][3] = "some value in here";
__device__ float* globTemp_d;

__global__ void compute(int *a, int w)
{
  int x = threadIdx.x + blockDim.x * blockIdx.x;
  int y = threadIdx.y + blockDim.y * blockIdx.y;
  int i = y*w+x;
  if(x<3 && y<3)
    a[i] = 1+globTemp_d[i];
}

int hostFunc(){
   float *a_d;
   cudaMalloc((void**)&a_d, 3*3*sizeof(int));
   cudaMalloc((void**)&globTemp_d, 3*3*sizeof(int));
   cudaMemcpy(globTemp_d,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
   compute<<<1,1>>>(a_d,3);
   cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);
}

However, I get seg fault when i try to access globTemp_d[i]. Am I doing something wrong in here?

Jo Skorsev
  • 99
  • 2
  • 7
  • Is globTemp_d ever initialised? –  Jun 05 '13 at 01:10
  • @Ldrumm I think the cudaMalloc initialize the globTemp_d right? – Jo Skorsev Jun 05 '13 at 01:13
  • Just checked that compiler attribute; looks like I'm getting tired, and probably shouldn't be questioning CUDA. –  Jun 05 '13 at 01:24
  • 1
    No, this code is broken in a variety of ways. If you do [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) you'll discover that some of your cuda API calls are failing. Then if you [read up](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-variable-qualifier) on `__device__` variables you'll see that you don't use `cudaMalloc` and `cudaMemcpy`. It's possible to do something like what you have shown, but why not use ordinary host-based pointers with `cudaMalloc` and `cudaMemcpy`? – Robert Crovella Jun 05 '13 at 01:28
  • Yes, you aren't use `cudaSafeCall` – Mikhail Jun 05 '13 at 02:02

1 Answers1

1

There are a variety of problems with your code:

  1. Your grid is a 1D grid of 1D threadblocks (in fact you are launching a single block of 1 thread) but your kernel is written as if it were expecting a 2D threadblock structure (using .x and .y built-in variables). A single thread won't get the work done certainly, and a 1D threadblock won't work with your kernel code.
  2. __device__ variables are not accessed with cudaMalloc and cudaMemcpy. We use a different set of API calls like cudaMemcpyToSymbol.
  3. You're not doing any cuda error checking which is always recommended when you're having difficulty. You should do cuda error checking on both API calls and kernel calls.
  4. You're mixing float variables (a_d ) with int variables in the kernel parameters (int *a) so I don't think this code would compile without at least a warning. And that can lead to strange behavior of course if you ignore it.

This is the closest I could come to your code while fixing all the errors:

#include <stdio.h>

__device__ float* globTemp_d;

__global__ void compute(float *a, int w)
{
  int x = threadIdx.x + blockDim.x * blockIdx.x;
  int y = threadIdx.y + blockDim.y * blockIdx.y;
  int i = (y*w)+x;
  if((x<3) && (y<3))
    a[i] = 1.0f+globTemp_d[i];
}

int main(){
   float *a_d, *d_globTemp;
   float globTemp[3][3] = {0.1f, 0.2f, 0.3f, 0.4f, 0.5f, 0.6f, 0.7f, 0.8f, 0.9f};
   float a[(3*3)];
   dim3 threads(3,3);
   dim3 blocks(1);
   cudaMalloc((void**)&a_d, 3*3*sizeof(float));
   cudaMalloc((void**)&d_globTemp, 3*3*sizeof(float));
   cudaMemcpy(d_globTemp,globTemp, 3*3*sizeof(float), cudaMemcpyHostToDevice);
   cudaMemcpyToSymbol(globTemp_d, &d_globTemp, sizeof(float *));
   compute<<<blocks,threads>>>(a_d,3);
   cudaMemcpy(a,a_d, 3*3*sizeof(float), cudaMemcpyDeviceToHost);

   printf("results:\n");
   for (int i = 0; i<(3*3); i++)
     printf("a[%d] = %f\n", i, a[i]);
   return 0;
}

This code can be simplified by dispensing with the __device__ variable and just passing d_globTemp as a parameter to the kernel, and using it in place of references to globTemp_d. However I did not make that simplification.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257