3

Background: perform benchmarking/comparisson over GPGPU platforms.

Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.

Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.

CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.

Code DX_1:

// Setup
...
for (...) {
    startTimer();
    context->Dispatch(number_of_groups, 1, 1);
    times[i] = stopTimer();
}
// Release
...

Code DX_2:

for (...) {
    // Setup
    ...
    startTimer();
    context->Dispatch(number_of_groups, 1, 1);
    times[i] = stopTimer();
    // Release
    ...
}

Results (average times of 2^2 to 2^11 elements):

DX_1  DX_2   CUDA
1.6   205.5  24.8
1.8   133.4  24.8
29.1  186.5  25.6
18.6  175.0  25.6
11.4  187.5  26.6
85.2  127.7  26.3
166.4 151.1  28.1
98.2  149.5  35.2
26.8  203.5  31.6 

Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.

Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.

thorbear
  • 41
  • 4

2 Answers2

1

My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.

CUDA Event  Host QPC
4,6         30,0
4,8         30,0
5,0         31,0
5,2         32,0
5,6         34,0
6,1         34,0
6,9         31,0
8,3         47,0
9,2         34,0
12,0        39,0
16,7        46,0
20,5        55,0
32,1        69,0
48,5        111,0
86,0        134,0
182,4       237,0
419,0       473,0

In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.

Code Examples, CUDA

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;   
cudaEventRecord(start);
... 
// Launch my algorithm
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);        

OpenCL

cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
... 
// Launch my algorithm
...
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);    

DirectCompute

Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:

ID3D11Device*               device = nullptr;
...
// Setup
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
    D3D11_QUERY_DESC desc;
    desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
    desc.MiscFlags = 0;
    device->CreateQuery(&desc, &disjoint_query);
    desc.Query = D3D11_QUERY_TIMESTAMP;
    device->CreateQuery(&desc, &q_start);
    device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
... 
// Launch my algorithm
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;

C/C++/OpenMP

static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;

static void __inline startTimer()
{
    QueryPerformanceFrequency(&Frequency);
    QueryPerformanceCounter(&StartingTime);
}

static double __inline stopTimer()
{
    QueryPerformanceCounter(&EndingTime);
    ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
    ElapsedMicroseconds.QuadPart *= 1000000;
    ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
    return (double)ElapsedMicroseconds.QuadPart;
}

My code examples are taken out of context and I tried to do some clean-up but errors might be present.

thorbear
  • 41
  • 4
  • Might just set this as answer now but will await my solution for OpenGL and any possible suggestions. – thorbear Oct 29 '15 at 07:46
0

If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.

This is probably a good primer / example on how to do it:

https://mynameismjp.wordpress.com/2011/10/13/profiling-in-dx11-with-queries/

Adam Miles
  • 3,504
  • 17
  • 15
  • Thanks! I've put together code that uses DirectX 11's Timestamp queries and it seems to work fine. I've come to realise that my main issue will be fair comparisons. So far I've used Windows QueryPerformanceCounter (QPC) API on the host but that include a lot of overhead not included when using device timestamps. – thorbear Oct 23 '15 at 06:09
  • As you alluded to in your original question, timing how long it takes to actually issue the API calls themselves isn't really what you're interested in. Unless you can be sure the GPU was idle when you called Dispatch (possible), started executing the work immediately (likely not) and that the work has finished before you stop your timer (it won't) then using QPC isn't going to give you the information you want. – Adam Miles Oct 23 '15 at 10:52