0

As part of my thesis work I am working in a CUDA project (modifying somebody elses code, adding functionality, etc). Being new to CUDA this is turning to be a real challenge for me. I am working with compute capability 1.3 cards, 4 x Tesla C1060. And sadly, I am hitting some limitations of the platform.

I need to pass a couple of new structures to device, which I believe are been copied correctly. But, when trying to pass the pointers to structure on device on my kernel call I reach the 256 bytes limit (as addressed in this question).

My code goes like this:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);
  SimState DeviceMem;

  TetrahedronStructGPU *h_root = &(hstate->root);
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = &(hstate->faces);
  TriangleFacesGPU *d_faces;

  GPUThreadStates tstates;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, &DeviceMem, &tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates, /*OK, these 2*/
                                           d_root, d_faces);
                           // Limit of 256 bytes adding d_root and/or d_faces
  cudaThreadSynchronize();

  ...

}

The InitGPUStates function is in another source file:

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  ...
}

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

The two new structures are:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

The definition of kernel is

__global__ void kernel(SimState d_state, GPUThreadStates tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);

I will work on changing SimState d_state to pointer pass SimState *d_state. As well as GPUThreadStates tstates to GPUThreadStates *tstates.

Community
  • 1
  • 1
mrei
  • 121
  • 14
  • what is the declaration of `kernel`? For example you appear to be passing `tstates` *by value* to `kernel`. If `sizeof(GPUThreadStates)` is large, you can free up some breathing room by passing that structure by pointer rather than by value. The problem is, `d_root` and `d_faces` are *already* pointers. So if you're out of parameter space just adding those two pointers, you're going to need to shrink the size of something else you are passing, like `DeviceMem` (`sizeof(SimState)`) and `tstates` (`sizeof(GPUThreadStates)`). This will also affect your kernel code referencing these entities. – Robert Crovella Feb 24 '14 at 20:20
  • @RobertCrovella You are right. I wasn't sure I was doing the pointer passing properly. The kernel definition `__global__ void MCMLKernel(SimState d_state, GPUThreadStates tstates, TetrahedronStructGPU *d_root, TriangleFacesGPU *d_faces)` and both `d_state` and `tstates` are being passed by value, aren't they? – mrei Feb 24 '14 at 20:53
  • Yes, they appear to be, although you haven't actually shown the definition of `GPUThreadStates` and `SimState`. If the sizes of those are large, preventing you from adding `d_root` (a pointer) and `d_faces` (a pointer), then you will have to focus on those. – Robert Crovella Feb 24 '14 at 21:00
  • @RobertCrovella Thanks again. I am working on those, `GPUThreadStates` and `SimState` are quite big too. I am adding those definitions above. – mrei Feb 24 '14 at 21:35
  • @RobertCrovella I posted the modifications I made as an answer for better formatting. I am having errors `code=11(cudaErrorInvalidValue) "cudaMalloc((void**)&DeviceMem->n_photons_left, size)" `. I would really appreciate your help! Thanks! – mrei Feb 25 '14 at 00:10

2 Answers2

1

It seems that you haven't initialized the DeviceMem structure, which is supposed to hold the pointer that should be later initialized with cudaMalloc.

You should do something like:

SimState* DeviceMem;

cudaMalloc(&DeviceMem, sizeof(SimState)) 

too (or any other way to allocate memory for that pointer).

Ashalynd
  • 12,363
  • 2
  • 34
  • 37
  • It does not matter that I am allocating each element in the `SimState` structure by itself? I didn't include that code, only the first three elements `DeviceMem->n_p_left`, `DeviceMem->a`, and `DeviceMem->x`. But, I do the same for the following elements as well (`Rd_ra`, `A_rz`, and `Tt_ra`). – mrei Feb 25 '14 at 18:17
  • Since I am passing `DeviceMem` by reference (not value as before) in the `kernel` call, I might need to allocate the complete structure as mentioned. But, not 100% sure. The thing is that the debugging capability on the GPU is so limited (maybe I'm not finding the correct way yet). I find it difficult to track my data once on the device. – mrei Feb 25 '14 at 19:06
  • Well, strictly speaking it's not "passing by reference". You are passing by pointer. If you were passing by reference, your object would have already existed and you'd been using a reference to it as a function argument, e.g. `void f(int& i)` - here `i`, the function argument, is passed as a reference. But it's only possible in C++, not in pure C. – Ashalynd Feb 25 '14 at 19:13
  • @kronos @t_carn In the [question](http://stackoverflow.com/questions/19404965/how-to-use-cudamalloc-cudamemcpy-for-a-pointer-to-a-structure-containing-point). How do you make your kernel call `doThings`? Do you pass the pointer to structure `Matrix` as a whole or to `Matrix.elements` individually? – mrei Feb 25 '14 at 19:21
  • As far as I can see, they allocate the array of Matrix structures by the regular malloc, and then pass the data members of these structures to the CUDA functions. – Ashalynd Feb 25 '14 at 19:24
  • You are right. I tried without the casting the pointer to void** and I am not getting the error anymore... Also, I am doing `cudaMalloc((void**)&DeviceMem, sizeof(SimState))`. I am thinking that in individual elements of the structure I don't need the casting since I'am referencing to a pointer in that structure. But, for `DeviceMem` I still need to use void**. – mrei Feb 25 '14 at 19:26
0

Finally, solved the 256 bytes issue. But, really still lost in pointers

My modified code goes like this:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);

  // new pointers to pass
  SimState *DeviceMem = (SimState*)malloc(sizeof(SimState));
  GPUThreadStates *tstates = (GPUThreadStates*)malloc(sizeof(GPUThreadStates));

  TetrahedronStructGPU *h_root = hstate->root; //root, pointer in HostThreadState
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = hstate->faces; //faces, pointer in HostThreadState
  TriangleFacesGPU *d_faces;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, DeviceMem, tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates,
                                           d_root, d_faces);
                                         // No limit reached!
  cudaThreadSynchronize();

  ...      
}

In the InitGPUStates function the changes are as follow. Special attention to the copy of DeviceMem (I tried many forms without success). Some forms (with parenthesis, like this cudaMalloc((void **)&(*DeviceMem).n_p_left, size)) will not give me any error. I am assuming that no errors means no data copied to device. In the current form the error is code=11(cudaErrorInvalidValue) "cudaMalloc((void**)&DeviceMem->n_photons_left, size)".

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  // HELP NEEDED MAINLY FROM HERE REGARDING POINTER VALUE COPY!
  checkCudaErrors( cudaMalloc((void**)&DeviceMem, sizeof(SimState) ); //Needed?

  size = sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->n_p_left, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->n_p_left,
                   HostMem->n_p_left, size, cudaMemcpyHostToDevice) );

  size = n_threads * sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->a, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->a, HostMem->a, size,
                                      cudaMemcpyHostToDevice) );
  size = n_threads * sizeof(UINT64);
  checkCudaErrors( cudaMalloc(&DeviceMem->x, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->x, HostMem->x, size,
                                      cudaMemcpyHostToDevice) );
  ...
}

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

The two new structures are:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

The definition of kernel is changed to:

__global__ void kernel(SimState *d_state, GPUThreadStates *tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);
mrei
  • 121
  • 14
  • Should it not be something like `cudaMalloc((void**)&(DeviceMem->n_p_left), size)` ? – Ashalynd Feb 25 '14 at 00:13
  • @Ashalynd I tried. Still getting the same error code=11(cudaErrorInvalidValue) – mrei Feb 25 '14 at 00:22
  • 1
    there is already a similar question: http://stackoverflow.com/questions/19404965/how-to-use-cudamalloc-cudamemcpy-for-a-pointer-to-a-structure-containing-point (And it looks like you don't need to cast your pointer to void**) – Ashalynd Feb 25 '14 at 00:25
  • Ah, there is something else. Your InitGPUStates function should receive a pointer to your DeviceMem and then give that pointer to the cudaMalloc. And you forgot to initialize DeviceMem structure :) – Ashalynd Feb 25 '14 at 00:32
  • @Ashalynd by initialize `DeviceMem` you mean `cudaMalloc` as in your answer? I know there is a question about void** around, gonna re-read the answers better to understand better the [need of void**](http://stackoverflow.com/questions/12936986/why-does-cudamalloc-use-pointer-to-pointer) in the first place. Thanks! I am checking the [question](http://stackoverflow.com/questions/19404965/how-to-use-cudamalloc-cudamemcpy-for-a-pointer-to-a-structure-containing-point) as well. – mrei Feb 25 '14 at 17:46
  • Basically I mean that you have declared DeviceMem as a pointer-to-struct, but the struct itself was not allocated (at least I didn't see it in provided code snippets). That is probably the reason why you were getting these errors, because the pointer was pointing to garbage. – Ashalynd Feb 25 '14 at 19:10