How to use 3D matrices with CULA on a GPU?

Question

In a CPU version of some Code, I have many things that look like the following:

for(int i =0;i<N;i++){

    dgemm(A[i], B[i],C[i], Size[i][0], Size[i][1], Size[i][2], Size[i][3], 'N','T');

}

Where A[i] will be a 2D matrix of some size.

I would like to be able to do this on a GPU using CULA (I'm not just doing multiplies, so I need the Linear ALgebra operations in CULA), so for example:

 for(int i =0;i<N;i++){
        status = culaDeviceDgemm('T', 'N', Size[i][0], Size[i][0], Size[i][0], alpha, GlobalMat_d[i], Size[i][0], NG_d[i], Size[i][0], beta, GG_d[i], Size[i][0]);
}

However, I would like to store my B's on the GPU in advance at the start of the program as they don't change, but I have no idea how to go about doing that. Or how I could store my arrays in general so that this is possible.

I've seen various things online about using 3D matrices with CUDA, but they don't seem very applicable to being able to then make a function call to the CULA functions.

From the example in the answer below I have this:

extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){


  cudaError_t err;
 err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
 checkCudaError(err);

    for(int i =0; i < numpulsars;i++){
         err = cudaMalloc( (void **) &(GlobalFVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
         checkCudaError(err);    
       //  err = cudaMemcpy( GlobalFVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
        // checkCudaError(err); 
        }

}

Where I have declared double **GlobalFVecs_d to be a global. But I get a seg fault when it hits the line

 err = cudaMalloc( (void **) &(GlobalFVecs_d[i]), numcoeff*numcoeff*sizeof(double) );

Yet it seems to be exactly what is in the other example?

I realised it wasn't the same, so I now have code that compiles, with:

double **GlobalFVecs_d;
double **GlobalFPVecs_d;

extern "C" void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff){


  cudaError_t err;
  GlobalFPVecs_d = (double **)malloc(numpulsars * sizeof(double*));
 err = cudaMalloc( (void ***)&GlobalFVecs_d, numpulsars*sizeof(double*) );
 checkCudaError(err);

    for(int i =0; i < numpulsars;i++){
         err = cudaMalloc( (void **) &(GlobalFPVecs_d[i]), numcoeff*numcoeff*sizeof(double) );
         checkCudaError(err);    
         err = cudaMemcpy( GlobalFPVecs_d[i], FNFVecs[i], sizeof(double)*numcoeff*numcoeff, cudaMemcpyHostToDevice );
         checkCudaError(err);   
        }

         err = cudaMemcpy( GlobalFVecs_d, GlobalFPVecs_d, sizeof(double*)*numpulsars, cudaMemcpyHostToDevice );
         checkCudaError(err);

}

However, if I now try and access it with:

 dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
 dim3 dimGrid;//((G + dimBlock.x - 1) / dimBlock.x,(N + dimBlock.y - 1) / dimBlock.y);
 dimGrid.x=(numcoeff + dimBlock.x - 1)/dimBlock.x;
 dimGrid.y = (numcoeff + dimBlock.y - 1)/dimBlock.y;

 for(int i =0; i < numpulsars; i++){
    CopyPPFNF<<<dimGrid, dimBlock>>>(PPFMVec_d, GlobalFVecs_d[i], numpulsars, numcoeff, i);
 }

It seg faults here instead, is this not how to get at the data?

score 0 · Accepted Answer · edited May 23 '17 at 11:57

Allocate memory for B with cudaMalloc()
Copy it from host to device with cudaMemcpy()
Pass the device pointer in the kernel argument list

Finally you use it from the kernel with the argument you have passed! Example:

  1     //  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide 
  2     __global__  void vecAdd(float* A, float* B, float* C) 
  3     { 
  4        // threadIdx.x is a built-in variable  provided by CUDA at runtime 
  5        int i = threadIdx.x; 
  6        A[i]=0; 
  7        B[i]=i; 
  8        C[i] = A[i] + B[i]; 
  9     } 
  10     
  11     #include  <stdio.h> 
  12     #define  SIZE 10 
  13     int  main() 
  14     { 
  15         int N=SIZE; 
  16         float A[SIZE], B[SIZE], C[SIZE]; 
  17         float *devPtrA; 
  18         float *devPtrB; 
  19         float *devPtrC; 
  20         int memsize= SIZE * sizeof(float); 
  21     
  22         **cudaMalloc((void**)&devPtrA, memsize);** 
  23         cudaMalloc((void**)&devPtrB, memsize); 
  24         cudaMalloc((void**)&devPtrC, memsize); 
  25         **cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);** 
  26         cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice); 
  27         // __global__ functions are called:  Func<<< Dg, Db, Ns  >>>(parameter); 
  28         **vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);** 
  29         cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost); 
  30     
  31         for (int i=0; i<SIZE; i++) 
  32          printf("C[%d]=%f\n",i,C[i]); 
  33     
  34          cudaFree(devPtrA); 
  35         cudaFree(devPtrA); 
  36         cudaFree(devPtrA); 
  37     }

The ** areas are the important part for you. Example taken from here. You may want to look at this question.

EDIT#1: First of all to declare a kernel function you need to place the keyword __global__ before the returning type, e.g.

__global__ void copyFNFVecs_(double **FNFVecs, int numpulsars, int numcoeff).

Moreover, I would use just one pointer to the first element of the matrix you have.

double *devPtr.

Allocate it with

cudaMalloc((void*)&devPtr, size)

and then copy

cudaMemcpy(devPtr, hostPtr, size, hostToDevice).

Note that to calculate the size your structure you need the dimensions (say X and Y) and the size of the underlying type of elements (say double).

size_t size = X*Y*sizeof(double).

sizeof(double *) means size of pointer to a double which is incorrect (In 32bit machines the size of a pointer is 4 bytes but the size of double is 8 bytes).

Hi, thanks for the respnse, I'll have to post code in a followup answer, one moment — LindleyLentati, May 30 '13 at 21:15

How to use 3D matrices with CULA on a GPU?

1 Answers1