copy from GPU to CPU is slower than copying CPU to GPU

Question

I have started learning cuda for a while and I have the following problem

See how I am doing below:

Copy GPU

int* B;
// ...
int *dev_B;    
//initialize B=0

cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...

//Execute on GPU the following function which is supposed to fill in 
//the dev_B matrix with integers


findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);

Copy CPU again

cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);

Copying array B to dev_B takes only a fraction of a second. However copying array dev_B back to B takes forever.

The findNeiborElem function involves a loop for each thread e.g. it looks like that

__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){

    int tid=threadIdx.x + blockIdx.x * blockDim.x;
    while (tid<dev_Nel[0]){
        for (int j=1;j<=Nel;j++){
             // do some calculations
             B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach
             break; 
        }
    tid += blockDim.x * gridDim.x; 
    }
}

What's very wierd about it, is that the time to copy dev_B to B is proportional to the number of iterations of j index.

For example if Nel=5 then the time is approx 5 sec.

When I increase the Nel=20 the time is about 20 sec.

I would expect that the copy time should be independent of the inner iterations one need to assign the value of the Matrix dev_B.

Also I would expect that the time to copy the same matrix from and to CPU would be of the same order.

Do you have any idea what is wrong?

dreamcrash · Accepted Answer · 2018-09-30T08:12:31.073

Instead of using clock() to measure time, you should use events:

With events you would have something like this:

  cudaEvent_t start, stop;   // variables that holds 2 events 
  float time;                // Variable that will hold the time
  cudaEventCreate(&start);   // creating the event 1
  cudaEventCreate(&stop);    // creating the event 2
  cudaEventRecord(start, 0); // start measuring  the time

  // What you want to measure
  cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
  cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);

  cudaEventRecord(stop, 0);                  // Stop time measuring
  cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                            // work preceding the most recent call to cudaEventRecord()

  cudaEventElapsedTime(&time, start, stop); // Saving the time measured

EDIT : Additional information :

"The kernel launch returns control to the CPU thread before it is finished. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible"". credits to Robert Crovella

Yes you should use this method. The kernel launch returns control to the CPU thread *before it is finished*. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible". — Robert Crovella, Nov 12 '12 at 15:09

score 1 · Answer 2 · answered Nov 12 '12 at 09:52

1

As for your second question

 B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach

When performing calculation on the GPU, due to sync reasons, every thread which has finished his job does not perform any calculations until all the thread in the same workgroup have finished.

In other words, the time you need to perform this calculation will be that of the worst case, it doesn't matter if most of the threads don't go all the way down.

I am not sure about your first question, how do you measure the time? I am not too familiar with cuda, but I think that when copying from CPU to GPU the implementation bufferize your data, hiding the effective time involved.

answered Nov 12 '12 at 09:52

sbabbi

11,070
2
29
57

Thank you for the answer but let me clarify it. I'm purely referring on the time spend on the copy procedures. The time spent on the function "findNeiborElem" is practically negligible. To measure the time I use start = std::clock(); cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost); duration = ( std::clock() - start ) / (double) CLOCKS_PER_SEC; std::cout<<"Data Copied from GPU in: " << duration << " sec\n"; – giorgk Nov 12 '12 at 10:11
Well this depends on which time you want actually to measure, keep in mind that most of the operations are performed asynchronusly. In other words, when the procedure returns, little or no work has been done by the GPU yet, unless the procedure issues a sync point (which is the case of cudaMemcpyDeviceToHost). Here http://stackoverflow.com/questions/3553843/how-to-measure-the-execution-time-of-every-block-when-using-cuda is explained better than i can. – sbabbi Nov 12 '12 at 10:33
1

That, and the performance of the data path GPU->CPU has been irrelevant until CUDA, so it's probably not as thoroughly optimized (AGP even made the conscious decision that transfer speeds should be asymmetric). – Simon Richter Nov 12 '12 at 11:39

copy from GPU to CPU is slower than copying CPU to GPU

2 Answers2

Linked