I'm launching a set of kernels multiple (30) times. Every test of these 30 (they are deterministic, at every test a set of kernels is called 10 times and this number is fixed), at the beginning, I do cudaSetDevice(0) and everything gets malloc'd and memcpy'd. When the test is done and the execution time was taken, everything is cudaFree'd.
Here's a sample output from my program:
avg: 81.7189
times:
213.0105 202.8020 196.8834 202.4001 197.7123 215.4658 199.5302 198.6519 200.8467
203.7865 20.2014 20.1881 21.0537 20.8805 20.1986 20.6036 20.9458 20.9473 20.292
9 20.9167 21.0686 20.4563 24.5359 21.1530 21.7075 23.3320 20.5921 20.6506 19.933
1 20.8211
The first 10 kernels take about 200 ms, while the others take about 20 ms.
Apparently every kernel calculates the same values, they all print the correct one. But since I malloc every test in the same order, couldn't the GPU memory still have the same values from the previous execution?
Also, kernels aren't returning errors because I'm checking them. Every kernel launch has cudaThreadSynchronize() for debugging purposes and error checking right after them with this macro:
#define CUDA_ERROR_CHECK if( (error = cudaGetLastError()) != cudaSuccess) printf("CUDA error: %s\n", cudaGetErrorString(error));
Why is this happening?
I'm getting the execution times from windows functions:
void StartCounter()
{
LARGE_INTEGER li;
if(!QueryPerformanceFrequency(&li))
cout << "QueryPerformanceFrequency failed!\n";
PCFreq = double(li.QuadPart)/1000.0;
QueryPerformanceCounter(&li);
CounterStart = li.QuadPart;
}
void StopCounter()
{
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
double time = double(li.QuadPart-CounterStart)/PCFreq;
v.push_back(time);
}
Edit:
The mallocs, copys and other stuff aren't being timed. I only time the execution time (kernel launch and sync).
Visual Studio 2010's optimizations are turned on. Everything is set to maximize speed. CUDA's optimizations are on as well.