Cuda matrix copy program is very slow

Question

#include<stdio.h>
#include<assert.h>
void verify(float * A, float * B, int size);

__global__ void CopyData(float *d_array, float* d_dest_array, size_t pitch, int cols, int rows)
{
  for(int i=0; i<rows; i++){
        float *rowData = (float*)(((char*)d_array) + (i*pitch));
        for(int j=0; j<cols; j++){
            d_dest_array[i*cols+j] = *(rowData+j);
        }
    }
}

int main(int argc, char **argv)
{
    int row, col, i, j; 
    float time1, time2;
    float *d_array;                 // dev arr which mem will be alloc to
    float *d_dest_array;        // dev arr that will be a copy
    size_t pitch;                       // ensures correct data struc alignm    
    if(argc != 3)
  {
        printf("Usage: %s [row] [col]\n", argv[0]);
        return 1;
  }

    row = atoi(argv[1]);
    col = atoi(argv[2]);
    float *h1_array = new float[col*row];
    float *h2_array = new float[col*row];
    float *h_ori_array = new float[col*row];
    for (i = 0; i<row; i++){
        for(j = 0; j<col; j++){
            h_ori_array[i*col+j] = i*col + j;
        }
    }
    cudaEvent_t start, stop;

    cudaMallocPitch(&d_array, &pitch, col*sizeof(float), row);
    cudaMalloc(&d_dest_array, col*row*sizeof(float));
    cudaMemcpy2D(d_array, pitch, h_ori_array, col*sizeof(float), col*sizeof(float), row, cudaMemcpyHostToDevice);
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    //CopyData<<<100, 512>>>(d_array, d_dest_array, pitch, col, row);
    for (i = 0; i<row; i++){
        for(j = 0; j<col; j++){
            h1_array[i*col+j] = h_ori_array[i*col+j];
        }
    }
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time1, start, stop);

    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);
    CopyData<<<row*col/512, 512>>>(d_array, d_dest_array, pitch, col, row);
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time2, start, stop);

    cudaMemcpy2D(h2_array, pitch, d_dest_array, col*sizeof(float), col*sizeof(float), row, cudaMemcpyDeviceToHost);

    verify(h1_array, h2_array, row*col);

    free(h1_array); free(h2_array); free(h_ori_array);
  cudaFree(d_array); cudaFree(d_dest_array);
    printf("Exec time in ser = %f, par = %f ms with pitch %d", time1, time2, (int)pitch); 

    return 0;
}

void verify(float * A, float * B, int size)
{
    for (int i = 0; i < size; i++)
    {
        assert(A[i]==B[i]);
    }
     printf("Correct!");
}

It just makes a copy of a matrix. Both a serial and parallel version are written so that I can compare them.

It gives wrong answer if the array size is 64. For 256 and beyond, it gives correct answer. However it takes too long, 4 seconds for a 512x512 matrix.

I am not comfortable with cudaMemcpy2D. Can someone please pinpoint what I am doing wrong? Any suggestion regarding CUDA coding practices will also be appreciated. Also, while calling a kernel, how do I decide the block and grid dimension?

EDIT 1: The CopyData function that I have used does not use parallelism. I foolishly copied it from VIHARRI's answer at the bottom of the page.

The selected answer over there does not specify how the data was copied from host to device. Can someone show how it can be done using the cudaMallocPitch and cudaMemcpy2D functions? I am looking for the correct way to index inside the kernel as well as the correct way to copy a 2D array from host to device.

score 1 · Accepted Answer · answered May 24 '13 at 09:42

1

You're only running a single CUDA thread. (Actually on closer inspection you are running the same code in multiple threads, but the result is the same: you're not really exploiting the GPU hardware).

Ideally you need to run hundreds or thousands of concurrent threads to get best performance. One way to do this would be to have one thread per output element and then in each thread use the grid, block and thread IDs to determine which output element to process. Look at the examples in the CUDA SDK to understand the general pattern for parallel processing with CUDA.

answered May 24 '13 at 09:42

Paul R

208,748
37
389
560

So I need to make some drastic changes in CopyData. Am I right? – pyronic May 24 '13 at 09:49
Not drastic, but you need to get rid of the loops and just have each thread process one output element. – Paul R May 24 '13 at 09:50
@Paul_R My code was based on what I saw [here](http://stackoverflow.com/questions/5029920/how-to-use-2d-arrays-in-cuda). Is the example given by VIHARRI incorrect then? – pyronic May 24 '13 at 09:56
Look carefully at the code you linked to - it's not using loops in the kernel code - it's using the block and thread indices to determine which element to process. – Paul R May 24 '13 at 10:00
@Paul_R I am talking about the code at the bottom of the page. It has a for loop in the kernel code 'CopyData'. – pyronic May 24 '13 at 10:07
OK - I was looking at the question and the accepted answer from @ardiyu07 - the answer from VIHARRI is not a good example of a CUDA kernel (it's just focussing on allocating and passing the 2D array, not on good CUDA programming). – Paul R May 24 '13 at 10:29
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/30553/discussion-between-pyronic-and-paul-r) – pyronic May 24 '13 at 10:32

Cuda matrix copy program is very slow

1 Answers1