1

I'm launching a set of kernels multiple (30) times. Every test of these 30 (they are deterministic, at every test a set of kernels is called 10 times and this number is fixed), at the beginning, I do cudaSetDevice(0) and everything gets malloc'd and memcpy'd. When the test is done and the execution time was taken, everything is cudaFree'd.

Here's a sample output from my program:

avg:  81.7189
times:
213.0105 202.8020 196.8834 202.4001 197.7123 215.4658 199.5302 198.6519 200.8467
203.7865 20.2014 20.1881 21.0537 20.8805 20.1986 20.6036 20.9458 20.9473 20.292
9 20.9167 21.0686 20.4563 24.5359 21.1530 21.7075 23.3320 20.5921 20.6506 19.933
1 20.8211

The first 10 kernels take about 200 ms, while the others take about 20 ms.

Apparently every kernel calculates the same values, they all print the correct one. But since I malloc every test in the same order, couldn't the GPU memory still have the same values from the previous execution?

Also, kernels aren't returning errors because I'm checking them. Every kernel launch has cudaThreadSynchronize() for debugging purposes and error checking right after them with this macro:

#define CUDA_ERROR_CHECK  if( (error = cudaGetLastError()) != cudaSuccess) printf("CUDA error: %s\n", cudaGetErrorString(error));

Why is this happening?

I'm getting the execution times from windows functions:

void StartCounter()
{
    LARGE_INTEGER li;
    if(!QueryPerformanceFrequency(&li))
        cout << "QueryPerformanceFrequency failed!\n";

    PCFreq = double(li.QuadPart)/1000.0;

    QueryPerformanceCounter(&li);
    CounterStart = li.QuadPart;
}

void StopCounter()
{
    LARGE_INTEGER li;
    QueryPerformanceCounter(&li);
    double time = double(li.QuadPart-CounterStart)/PCFreq;
    v.push_back(time);
}

Edit:

The mallocs, copys and other stuff aren't being timed. I only time the execution time (kernel launch and sync).

Visual Studio 2010's optimizations are turned on. Everything is set to maximize speed. CUDA's optimizations are on as well.

hfingler
  • 1,931
  • 4
  • 29
  • 36

1 Answers1

1

Measuring kernel execution time using QueryPerformanceTime is wrong, because host call device and than they are working in parallel. You ara propably measuring only call time.

To check kernel execution time use as ahmad mentioned cudaEvents:

cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
...
cudaEventRecord(start, 0);
yourkernel <<< n_blocks, block_size >>> (a_d, N);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
... 
cudaEventElapsedTime(&time, start, stop);
printf ("Time for the kernel: %f ms\n", time);

If you want to use QueryPerformanceTime you have to call

cudaDeviceSynchronize();

after kernel call. It will wait until kernel stops.

Tomasz Dzięcielewski
  • 3,829
  • 4
  • 33
  • 47
  • I'm not measuring the launches because after every kernel launch I do a Synchronize. And after all the kernels have been launched, I do another one to make sure all kernels were executed before calculating the total elapsed time. I've already had this question here: http://stackoverflow.com/questions/12048209/trouble-measuring-the-elapsed-time-of-a-cuda-program-and-cuda-kernels – hfingler Nov 16 '12 at 15:46
  • @polar why don't you try using cudaEvent API as responders have already suggested, and see if you get similar or different results? Also, why don't you post a simple example (complete, compilable code) that reproduces the problem you think you have? If you still see varying times with cudaEvent, then gradually move various functions (e.g. cudaMalloc, cudaMemcpy, cudaFree) outside of your timing loop until the variability goes away. If the variability is still there when timing only the kernel, then you need to analyze why the kernel time may be varying. You haven't posted any of this code. – Robert Crovella Nov 16 '12 at 16:28
  • I will try it as soon as I can and post the results here. I'm only timing the kernel execution times, not the malloc, copy and other stuff. – hfingler Nov 16 '12 at 17:05
  • I wrote my master thesis using CUDA few months ago and I suggest you to try events. I'm sure that this is the best way to check execution time. Post your results because I'm curious. – Tomasz Dzięcielewski Nov 16 '12 at 19:06
  • I will. In fact, this is for my master thesis. – hfingler Nov 16 '12 at 21:04
  • Before takich a measure try to run very simple kernel, because CUDA environment have to initialization and first cuda call will take a lot of time. This kernel should be i.e. `a[0] = 1`; (run only one thread). – Tomasz Dzięcielewski Nov 16 '12 at 21:13
  • There is a curand setup kernel before each test, these aren't being timed. – hfingler Nov 20 '12 at 18:02