6

I took the code given as an answer for How can I add up two 2d (pitched) arrays using nested for loops? and tried to use it for 3D instead of 2D and changed other parts slightly too, now it looks as follows:

 __global__ void doSmth(int*** a) {
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++) 
     a[i][j][k]=i+j+k;
 }

 int main() {
  int*** h_c = (int***) malloc(2*sizeof(int**));
  for(int i=0; i<2; i++) {
   h_c[i] = (int**) malloc(2*sizeof(int*));
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
  }
  int*** d_c;
  GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
  GPUerrchk(cudaMemcpy(d_c,h_c,2*sizeof(int**),cudaMemcpyHostToDevice));
  doSmth<<<1,1>>>(d_c);
  GPUerrchk(cudaPeekAtLastError());

  int res[2][2][2];
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMemcpy(&res[i][j][0],
    h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));  

  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++) 
     printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);     
 }

In the above code I use 2 as sizes for each of the dimension of h_c, in the real implementation I will have these sizes in very large numbers and in different ones for every part of the subarrays of "int***" or more dimensions. I am getting problem with the part after the kernel call where I try to copy back the results to res array. Can you help me fix the problem? Plz can you show solution in the way I am writing it above. Thanks!

Community
  • 1
  • 1
starter
  • 325
  • 1
  • 5
  • 15

1 Answers1

11

First of all, I think talonmies when he posted the response to the previous question you mention, was not intending that to be representative of good coding. So figuring out how to extend it to 3D might not be the best use of your time. For example, why do we want to write programs which use exactly one thread? While there might be legitimate uses of such a kernel, this is not one of them. Your kernel has the possibility to do a bunch of independent work in parallel, but instead you are forcing it all onto one thread, and serializing it. The definition of the parallel work is:

a[i][j][k]=i+j+k;

Let's figure out how to handle that in parallel on the GPU.

Another introductory observation I would make is that since we are dealing with problems that have sizes that are known ahead of time, let's use C to tackle them with as much benefit as we can get from the language. Nested loops to do cudaMalloc may be needed in some cases, but I don't think this is one of them.

Here's a code that accomplishes the work in parallel:

#include <stdio.h>
#include <stdlib.h>
// set a 3D volume
// To compile it with nvcc execute: nvcc -O2 -o set3d set3d.cu
//define the data set size (cubic volume)
#define DATAXSIZE 100
#define DATAYSIZE 100
#define DATAZSIZE 20
//define the chunk sizes that each threadblock will work on
#define BLKXSIZE 32
#define BLKYSIZE 4
#define BLKZSIZE 4

// for cuda error checking
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            return 1; \
        } \
    } while (0)

// device function to set the 3D volume
__global__ void set(int a[][DATAYSIZE][DATAXSIZE])
{
    unsigned idx = blockIdx.x*blockDim.x + threadIdx.x;
    unsigned idy = blockIdx.y*blockDim.y + threadIdx.y;
    unsigned idz = blockIdx.z*blockDim.z + threadIdx.z;
    if ((idx < (DATAXSIZE)) && (idy < (DATAYSIZE)) && (idz < (DATAZSIZE))){
      a[idz][idy][idx] = idz+idy+idx;
      }
}

int main(int argc, char *argv[])
{
    typedef int nRarray[DATAYSIZE][DATAXSIZE];
    const dim3 blockSize(BLKXSIZE, BLKYSIZE, BLKZSIZE);
    const dim3 gridSize(((DATAXSIZE+BLKXSIZE-1)/BLKXSIZE), ((DATAYSIZE+BLKYSIZE-1)/BLKYSIZE), ((DATAZSIZE+BLKZSIZE-1)/BLKZSIZE));
// overall data set sizes
    const int nx = DATAXSIZE;
    const int ny = DATAYSIZE;
    const int nz = DATAZSIZE;
// pointers for data set storage via malloc
    nRarray *c; // storage for result stored on host
    nRarray *d_c;  // storage for result computed on device
// allocate storage for data set
    if ((c = (nRarray *)malloc((nx*ny*nz)*sizeof(int))) == 0) {fprintf(stderr,"malloc1 Fail \n"); return 1;}
// allocate GPU device buffers
    cudaMalloc((void **) &d_c, (nx*ny*nz)*sizeof(int));
    cudaCheckErrors("Failed to allocate device buffer");
// compute result
    set<<<gridSize,blockSize>>>(d_c);
    cudaCheckErrors("Kernel launch failure");
// copy output data back to host

    cudaMemcpy(c, d_c, ((nx*ny*nz)*sizeof(int)), cudaMemcpyDeviceToHost);
    cudaCheckErrors("CUDA memcpy failure");
// and check for accuracy
    for (unsigned i=0; i<nz; i++)
      for (unsigned j=0; j<ny; j++)
        for (unsigned k=0; k<nx; k++)
          if (c[i][j][k] != (i+j+k)) {
            printf("Mismatch at x= %d, y= %d, z= %d  Host= %d, Device = %d\n", i, j, k, (i+j+k), c[i][j][k]);
            return 1;
            }
    printf("Results check!\n");
    free(c);
    cudaFree(d_c);
    cudaCheckErrors("cudaFree fail");
    return 0;
}

Since you've asked for it in the comments, here is the smallest number of changes I could make to your code to get it to work. Let's also remind ourselves of some of talonmies comments from the previous question you reference:

"For code complexity and performance reasons, you really don't want to do that, using arrays of pointers in CUDA code is both harder and slower than the alternative using linear memory."

"it is such a poor idea compared to using linear memory."

I had to diagram this out on paper to make sure I got all my pointer copying correct.

#include <cstdio>
inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != 0) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }
}

#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }



 __global__ void doSmth(int*** a) {
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     a[i][j][k]=i+j+k;
 }
 int main() {
  int*** h_c = (int***) malloc(2*sizeof(int**));
  for(int i=0; i<2; i++) {
   h_c[i] = (int**) malloc(2*sizeof(int*));
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMalloc((void**)&h_c[i][j],2*sizeof(int)));
  }
  int ***h_c1 = (int ***) malloc(2*sizeof(int **));
  for (int i=0; i<2; i++){
    GPUerrchk(cudaMalloc((void***)&(h_c1[i]), 2*sizeof(int*)));
    GPUerrchk(cudaMemcpy(h_c1[i], h_c[i], 2*sizeof(int*), cudaMemcpyHostToDevice));
    }
  int*** d_c;
  GPUerrchk(cudaMalloc((void****)&d_c,2*sizeof(int**)));
  GPUerrchk(cudaMemcpy(d_c,h_c1,2*sizeof(int**),cudaMemcpyHostToDevice));
  doSmth<<<1,1>>>(d_c);
  GPUerrchk(cudaPeekAtLastError());
  int res[2][2][2];
  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    GPUerrchk(cudaMemcpy(&res[i][j][0], h_c[i][j],2*sizeof(int),cudaMemcpyDeviceToHost));

  for(int i=0; i<2; i++)
   for(int j=0; j<2; j++)
    for(int k=0; k<2; k++)
     printf("[%d][%d][%d]=%d\n",i,j,k,res[i][j][k]);
 }

In a nutshell, we have to do a successive sequence of:

  1. malloc a multidimensional array of pointers (on the host), one dimension less than the problem size, with the last dimension being a set of pointers to regions cudaMalloc'ed onto the device rather than the host.
  2. create another multidimensional array of pointers, of the same class as that created in the previous step, but one dimension less than that created in the previous step. this array must also have it's final ranks cudaMalloc'ed on the device.
  3. copy the last set of host pointers from the second previous step into the area cudaMalloced on the device in the previous step.
  4. repeat steps 2-3 until we end up with a single (host) pointer pointing to the multidimensional array of pointers, all of which are now resident on the device.
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • thanks, plz can you show me how to solve it the way i am doing it currently. thanks a lot! – starter Oct 16 '12 at 23:35
  • Can you provide a complete, compilable example of what it is you are trying to do? It's a matter of convenience for those who are trying to help you. – Robert Crovella Oct 17 '12 at 00:34
  • Sorry if I did not ask it correctly. The above example is the complete one. If I will change there anything, that would be the size of "h_c" it may become "int****" where each of its subarrays will have very large and different sizes. Thanks! Sorry again if I asked it wrongly. Thanks a lot in advance for your help. – starter Oct 17 '12 at 01:04
  • The example you provided is not complete or compilable. One reason is that you do not define GPUerrchk anywhere, for example. Yes, I can go to the previous question and piece together your intent by combining bits of that example with yours, but that's inconvenient, especially since you presumably have the complete code and are compiling it. Anyway, I have updated my answer with a response to show you how to solve it the way you are doing it currently. – Robert Crovella Oct 17 '12 at 02:50
  • I know this may be too much, but is it possible to do everything with h_c only without the part with h_c1? Thanks!!! – starter Oct 17 '12 at 03:36
  • Nothing like that occurred to me. You may find a way, however. – Robert Crovella Oct 17 '12 at 05:43
  • @RobertCrovella, thank you! I thought of doing something similar as the thread starter here, and now I know it's simply the wrong way and how to do it in a proper manner... – itsid Feb 09 '13 at 03:38