1

I have a function look like this:

int doSomething() {
    <C++ host code>
    <CUDA device code>
    <C++ host code>
    <...>
}

I would like to measure the running time of this function with high precision (at least millisecond) on Linux and on Windows too.

I know how I can measure the running time of a CUDA program with events, and I have found very accurate libraries for measuring the CPU time used by my process, but I want to measure the overall running time. I can't measure the two time differently and add them together because device code and host code can run parallel.

I want to use as few external library as possible, but I am interested in any good solution.

SqrtPi
  • 224
  • 3
  • 11
  • 1
    possible duplicate of [how to measure gpu vs cpu performance , with which time measuring functions?](http://stackoverflow.com/questions/16258141/how-to-measure-gpu-vs-cpu-performance-with-which-time-measuring-functions) – talonmies Apr 30 '13 at 17:16
  • This has been asked many time before, as recently as *two days ago*. Please search or check the recent questions and FAQs for the CUDA tag before asking a question. – talonmies Apr 30 '13 at 17:17
  • Have you tried to use CUDA profiler? I insert intended cudaDeviceSych command in order to measure the CPU timing using the profiler. – TripleS Apr 30 '13 at 17:42
  • @talonmies: I checked that topics, but all of them measure the running time of the CPU and GPU code separately with different method and I want to measure the overall running time of a function containing pure C++ host code Cuda device code also. Adding the two time won't help me so much because of the CPU/GPU concurancy. – SqrtPi Apr 30 '13 at 17:53
  • @TripleS: It is a very good idea, but I prefer to measure it from the C++ source code, because I usually do a lot of measure with different input parameters. It would be much easier for me if I can store the result from C++ instead of reading it out from a UI. – SqrtPi Apr 30 '13 at 17:56
  • if that's what you want, best thing to do is to use cuda counters, i'd highly recommend build / find a cuda stopwatch class, I used to have one, let me know if you want me to post it over – TripleS May 01 '13 at 11:22
  • Thank you for your help, but I think I will use the solution suggested by Robert Crovella. I have found a similar method for Windows based systems and wrote two small function (start timer and stop timer) with compile time derivatives to distinguish between platforms. – SqrtPi May 03 '13 at 07:40

2 Answers2

2

According to the sequence you have shown, I would recommend you do the following:

int doSomething() {
  <C++ host code>
  <CUDA device code>
  <C++ host code>
  <...>
  cudaDeviceSynchronize();  // add this
}

and:

<use your preferred CPU high precision measurement start function>
doSomething();
<use your preferred CPU high precision measurement stop function>

The added cudaDeviceSynchronize() call is not necessary if you have some prior implicit synchronization, such as a cudaMemcpy() call after the last kernel in the <CUDA device code> section.

Responding to a question in the comments below, @JackOLantern seems to be suggesting a high-precision CPU timing method with start (tic) and stop (toc) points in the answer here. Also pointed out by talonmies. If you don't like using the results returned by CLOCK_MONOTONIC you might also try specifying CLOCK_REALTIME_HR instead. On a linux box do man clock_gettime for more info.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I am looking something like this, but all high precision timer I know, measure the CPU time spent on the process, not the real time between the start and the end call. I would appreciate if you can suggest me a high precision timer which measure the real execution time instead of the CPU time. – SqrtPi Apr 30 '13 at 19:42
  • Edited my answer to respond to this question. – Robert Crovella Apr 30 '13 at 19:59
0

For windows:

LARGE_INTEGER perfCntStart, perfCntStop, proc_freq; 
::memset( &proc_freq, 0x00, sizeof(proc_freq) );
::memset( &perfCntStart, 0x00, sizeof(perfCntStart) ); 
::memset( &perfCntStop, 0x00, sizeof(perfCntStop) );
::QueryPerformanceCounter( &perfCntStart ); 
::QueryPerformanceFrequency( &proc_freq );

.. do something

::QueryPerformanceCounter( &perfCntStop ); 
printf( ": %f\n", float( perfCntStop.QuadPart - perfCntStart.QuadPart ) / float(proc_freq.QuadPart) ); }