7

I am trying to use cublas function cublasSgemmBatched in my toy example. In this example I first allocate 2D arrays: h_AA, h_BB of the size [6][5] and h_CC of the size [6][1]. After that I copied it to the device, performed cublasSgemmBatched and tried to copy array d_CC back to the host array h_CC. However, I got a error (cudaErrorLaunchFailure) with device to host copying and I am not sure that I copied arrays into the device correctly:

int main(){
    cublasHandle_t handle;
    cudaError_t cudaerr;
    cudaEvent_t start, stop;
    cublasStatus_t stat;
    const float alpha = 1.0f;
    const float beta = 0.0f;
    float **h_AA, **h_BB, **h_CC;
    h_AA = new float*[6];
    h_BB = new float*[6];
    h_CC = new float*[6];
    for (int i = 0; i < 6; i++){
        h_AA[i] = new float[5];
        h_BB[i] = new float[5];
        h_CC[i] = new float[1];
        for (int j = 0; j < 5; j++){
            h_AA[i][j] = j;
            h_BB[i][j] = j;
        }
        h_CC[i][0] = 1;
    }
    float **d_AA, **d_BB, **d_CC;
    cudaMalloc(&d_AA, 6 * sizeof(float*));
    cudaMalloc(&d_BB, 6 * sizeof(float*));
    cudaMalloc(&d_CC, 6 * sizeof(float*));
    cudaerr = cudaMemcpy(d_AA, h_AA, 6 * sizeof(float*), cudaMemcpyHostToDevice);
    cudaerr = cudaMemcpy(d_BB, h_BB, 6 * sizeof(float*), cudaMemcpyHostToDevice);
    cudaerr = cudaMemcpy(d_CC, h_CC, 6 * sizeof(float*), cudaMemcpyHostToDevice);
    stat = cublasCreate(&handle);
    stat = cublasSgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, 1, 5, &alpha,
             (const float**)d_AA, 1, (const float**)d_BB, 5, &beta, d_CC, 1, 6);
    cudaerr = cudaMemcpy(h_CC, d_CC, 6 * sizeof(float*), cudaMemcpyDeviceToHost);
    cublasDestroy(handle);
}

So this code works, however the last cudaerr returns cudaErrorLaunchFailure. I was trying to follow this sample code on Github.

Thanks

P.S. What I don't understand, what is the sizeof(float*) and how cudaMalloc knows how many memory required for each array (like here I determine the size of 1 dimension only).

UPDATE: I did it!!:

cublasHandle_t handle;
cudaError_t cudaerr;
cudaEvent_t start, stop;
cublasStatus_t stat;
const float alpha = 1.0f;
const float beta = 0.0f;

float *h_A = new float[5];
float *h_B = new float[5];
float *h_C = new float[6];
for (int i = 0; i < 5; i++)
{
    h_A[i] = i;
    h_B[i] = i;
}



float **h_AA, **h_BB, **h_CC;
h_AA = (float**)malloc(6* sizeof(float*));
h_BB = (float**)malloc(6 * sizeof(float*));
h_CC = (float**)malloc(6 * sizeof(float*));
for (int i = 0; i < 6; i++){
    cudaMalloc((void **)&h_AA[i], 5 * sizeof(float));
    cudaMalloc((void **)&h_BB[i], 5 * sizeof(float));
    cudaMalloc((void **)&h_CC[i], sizeof(float));
    cudaMemcpy(h_AA[i], h_A, 5 * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(h_BB[i], h_B, 5 * sizeof(float), cudaMemcpyHostToDevice);
}
float **d_AA, **d_BB, **d_CC;
cudaMalloc(&d_AA, 6 * sizeof(float*));
cudaMalloc(&d_BB, 6 * sizeof(float*));
cudaMalloc(&d_CC, 6 * sizeof(float*));
cudaerr = cudaMemcpy(d_AA, h_AA, 6 * sizeof(float*), cudaMemcpyHostToDevice);
cudaerr = cudaMemcpy(d_BB, h_BB, 6 * sizeof(float*), cudaMemcpyHostToDevice);
cudaerr = cudaMemcpy(d_CC, h_CC, 6 * sizeof(float*), cudaMemcpyHostToDevice);
stat = cublasCreate(&handle);
    stat = cublasSgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, 1, 5, &alpha, 
             (const float**)d_AA, 1, (const float**)d_BB, 5, &beta, d_CC, 1, 6);
    cudaerr = cudaMemcpy(h_CC, d_CC, sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 6;i++)
        cudaMemcpy(h_C+i, h_CC[i], sizeof(float), cudaMemcpyDeviceToHost);
cublasDestroy(handle);
Mikhail Genkin
  • 3,247
  • 4
  • 27
  • 47
  • The messed up data you are passing is causing one of the kernels launched by the batch gemm call to fail. As an asynchronous error, you may not receive notification of it until the next cuda call. Have you studied the batched cublas [cuda sample code](http://docs.nvidia.com/cuda/cuda-samples/index.html#batchcublas) ? – Robert Crovella Jan 13 '15 at 21:27
  • I didn't, doing it right now – Mikhail Genkin Jan 13 '15 at 21:31
  • I did it! Thanks. So, do I understand correct: In order to have a business with 2D device arrays, you should create host array of pointers to device arrays, and after that copy this array to 2D device array memory. In order to retrieve 2D host array from 2D device array, you should again use intermediate 2D array, which is array of host pointers to device arrays. I posted working code into the Update – Mikhail Genkin Jan 13 '15 at 22:18
  • Yes, it is an example of a need for a deep-copy mechanism, which would be similar to what you would have to do if you wanted to copy a matrix to the device and be able to access it directly using double-subscript notation. What you are calling 2D device arrays are still linear/flattened arrays. The "2D" or deep-copy aspect of it comes in because you have an array of these arrays that you want to pass to the device, and this is analogous to the deep-copy mechanism required to pass a doubly-subscripted array. Why don't you post your update as an answer. It's OK to answer your own question – Robert Crovella Jan 13 '15 at 22:22

1 Answers1

9

So, I figured out the answer (thanks to @Robert Crovella): in order to create device array of pointers to device arrays (for batched functions), one should first create host array of pointers to device arrays, and after that copy it into device array of pointers to device arrays. The same is true about transfering back to host: one should use intermediate host array of pointers to device arrays.

cublasHandle_t handle;
cudaError_t cudaerr;
cudaEvent_t start, stop;
cublasStatus_t stat;
const float alpha = 1.0f;
const float beta = 0.0f;

float *h_A = new float[5];
float *h_B = new float[5];
float *h_C = new float[6];
for (int i = 0; i < 5; i++)
{
    h_A[i] = i;
    h_B[i] = i;
}



float **h_AA, **h_BB, **h_CC;
h_AA = (float**)malloc(6* sizeof(float*));
h_BB = (float**)malloc(6 * sizeof(float*));
h_CC = (float**)malloc(6 * sizeof(float*));
for (int i = 0; i < 6; i++){
    cudaMalloc((void **)&h_AA[i], 5 * sizeof(float));
    cudaMalloc((void **)&h_BB[i], 5 * sizeof(float));
    cudaMalloc((void **)&h_CC[i], sizeof(float));
    cudaMemcpy(h_AA[i], h_A, 5 * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(h_BB[i], h_B, 5 * sizeof(float), cudaMemcpyHostToDevice);
}
float **d_AA, **d_BB, **d_CC;
cudaMalloc(&d_AA, 6 * sizeof(float*));
cudaMalloc(&d_BB, 6 * sizeof(float*));
cudaMalloc(&d_CC, 6 * sizeof(float*));
cudaerr = cudaMemcpy(d_AA, h_AA, 6 * sizeof(float*), cudaMemcpyHostToDevice);
cudaerr = cudaMemcpy(d_BB, h_BB, 6 * sizeof(float*), cudaMemcpyHostToDevice);
cudaerr = cudaMemcpy(d_CC, h_CC, 6 * sizeof(float*), cudaMemcpyHostToDevice);
stat = cublasCreate(&handle);
    stat = cublasSgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, 1, 1, 5, &alpha, 
             (const float**)d_AA, 1, (const float**)d_BB, 5, &beta, d_CC, 1, 6);
    cudaerr = cudaMemcpy(h_CC, d_CC, sizeof(float), cudaMemcpyDeviceToHost);
    for (int i = 0; i < 6;i++)
        cudaMemcpy(h_C+i, h_CC[i], sizeof(float), cudaMemcpyDeviceToHost);
cublasDestroy(handle);
Mikhail Genkin
  • 3,247
  • 4
  • 27
  • 47
  • I think it should be ```cudaMemcpy(h_CC, d_CC, 6*sizeof(float*), cudaMemcpyDeviceToHost);``` just compare it with ```cudaMemcpy(d_CC, h_CC, 6*sizeof(float*), cudaMemcpyHostToDevice);``` it a typo? – PhysicsMath Feb 22 '22 at 10:22