The overhead of a OpenCL or CUDA call?

Question

I'm writing a function that does a lot of BLAS gemv operations.

I would like to be able to do this on the GPU, and I've tried with cuBlas.

My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.

Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.

That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.

How would I go about doing this?

So that you can dynamically choose whether to send a job to CUDA, or just for interest's sake? — Rup, Jan 24 '12 at 10:50
@Rup : Just interested in figuring out what the call actually costs. and determining if the slow code is my fault or just a product of the architechture ;-) — Martin Kristiansen, Jan 24 '12 at 10:53
For small amounts of data its not just the overhead hurting you, but also the lack of possible parallelism. The GPU depends havily on having enough threads to hide latencies (which are a lot worse on gpu then for cpu). Even without call overheads the GPU is likely to be slower then the cpu unless the work is partitioned into **lots** of threads. Lots can easily mean thousands of threads. — Grizzly, Jan 25 '12 at 23:40
@Grizzly I'm aware that the GPU needs lots of threads inorder to hide mem-access. But that begs the question, how many is alot? would 100 be alot or would we have to be in the range of 1000 or millions? — Martin Kristiansen, Jan 26 '12 at 12:10
@MartinKristiansen: It depends what kind of latencies need to be hidden. Optimal occupancy is typical somewhere in the range of tens thousands. I would typically say that anything with less then 1000 threads is probably not going to profit from gpu calculation and less then 10000 threads will still waste most of the gpus potential. Of course these are rules of thumb, in reality it depends on the kernel (particulary the amount of global memory accesses, which are hard to hide) and the used gpu — Grizzly, Jan 26 '12 at 14:44

score 8 · Accepted Answer · edited Jul 13 '20 at 03:34

Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks

The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.

N       FFTw (ms)   GPUFFT (ms)     GPUFFT MFLOPS   GPUFFT Speedup
8         0           0.06             3.352705     0.006881
16        0.001       0.065            7.882117     0.010217
32        0.001       0.075           17.10887      0.014695
64        0.002       0.085           36.080118     0.026744
128       0.004       0.093           76.724324     0.040122
256       0.007       0.107          153.739856     0.066754
512       0.015       0.115          320.200892     0.134614
1024      0.034       0.125          657.735381     0.270512
2048      0.076       0.156         1155.151507     0.484331
4096      0.173       0.215         1834.212989     0.804558
8192      0.483       0.32          2664.042421     1.510011
16384     1.363       0.605         3035.4551       2.255411
32768     3.168       1.14          3450.455808     2.780041
65536     8.694       2.464         3404.628083     3.528726
131072   15.363       5.027         3545.850483     3.05604
262144   33.223      12.513         3016.885246     2.655183
524288   72.918      25.879         3079.443664     2.817667
1048576 173.043      76.537         2192.056517     2.260904
2097152 331.553     157.427         2238.01491      2.106081
4194304 801.544     430.518         1715.573229     1.861814

The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.

If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.

In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).

Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).

Hope this helps,

Anon · Answer 2 · 2012-05-09T20:13:00.767

To find the call overhead, call a CUDA kernel that does as little as possible.

for (int i=0; i<NLoops; i++) {
    gettimeofday(&cpuStart, 0); // get start time on CPU  

    // Call minimal CUDA kernel  

    gettimeofday(&cpuEnd, 0); // get end time on CPU 

    // save elapsed time
}

Follow the code of Alex P. above.

The less processing you do in the kernel, the more the time difference will be only the call overhead.

Do a little experimenting to find a good value for NLoops (maybe 1,000,000). Be sure that the elapsed time is longer than the interval of your timer, or you'll end up with all zeros. If that happens, write some kernel code that executes in a fixed time interval that you can predict: (n loops of x cycles each).

It's hard to remove all the non-CUDA computations that might occur between cpuStart and cpuEnd (like interrupt processing), but making several runs and averaging can give good results.

score 1 · Answer 3 · answered Jan 24 '12 at 11:14

You can get the time in nanoseconds from the device when an event was queued, submitted, started, and finished by using clGetEventProfilingInfo on your buffer transfer event.

more info, and how to set it up here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html

I think that for 100x100 matrices, you may be better off sticking to cpu for the crunching. Unless you have many to multiply at the same time, the benefit of the gpu will be hardly noticeable due to the (small) transfer overhead and usually much lower clock speeds. Make sure you tweak your kernel to use as much of the local data as possible - on my hardware, there is 32KB per work group, and that should be plenty to hold two 100x100 matrices. The built-in dot product functions should also be very handy too.

There was an awesome talk about this at ADFS last year (see sessionId: 2908) http://developer.amd.com/afds/pages/OLD/sessions.aspx They talk in detail about optimizing the kernel, and hard-coding the optimal sizes.

Alex P. · Answer 4 · 2012-01-24T14:45:00.520

Are your matrices already on the GPU? If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.

Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back. If you can, use bigger matrices. Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.

If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:

cudaEvent_t start, stop;

cudaEventRecord(&start);

struct timeval cpuStart, cpuEnd;

gettimeofday(&cpuStart, 0); // get start time on CPU

// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...

gettimeofday(&cpuEnd, 0); // get end time on CPU

double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds

cudaEventRecord(&stop);

// Wait until the stop event occurred
cudaError_t eventResult;

do
{
  eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);

// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>

// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);

// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);

You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...

(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)

EDIT: Just saw that you want to measure the call itself, not the duration of the kernel. You can do that by simply measuring the time on the CPU for the call - see the updated code above. This works only on Linux because gettimeofday is not available for Windows (AFAIK).

On Windows you can use [QueryPerformanceCounter](http://msdn.microsoft.com/en-us/library/ms644904) or [GetSystemTime](http://msdn.microsoft.com/en-us/library/windows/desktop/ms725473.aspx) etc. — Rup, Jan 24 '12 at 15:12
I've got all the data on the device, and only need to do a simple Ax->y and then keep y on the device. — Martin Kristiansen, Jan 24 '12 at 15:26
In that case you can measure the time CUBLAS needs to launch the actual kernel by putting gettimeofday() (or a similar method on Windows) around your cublasDgemm() call. While I have not tried it myself, you could look into using Parallel Nsight (on Windows), or the Visual Compute Profiler (included in the toolkit on Linux). I can't find it right now, but I'm sure I've seen something about profiling hooks in CUDA 4... EDIT: Found this PDF that has some interesting information about profiling CUDA: http://bit.ly/zn6jbP — Alex P., Jan 24 '12 at 16:37

The overhead of a OpenCL or CUDA call?

4 Answers4