Measuring CUDA Allocation time

Question

I need to measure the time difference between allocating normal CPU memory with new and a call to cudaMallocManaged. We are working with unified memory and are trying to figure out the trade-offs of switching things to cudaMallocManaged. (The kernels seem to run a lot slower, likely due to a lack of caching or something.)

Anyway, I am not sure the best way to time these allocations. Would one of boost's process_real_cpu_clock, process_user_cpu_clock, or process_system_cpu_clock give me the best results? Or should I just use the regular system time call in C++11? Or should I use the cudaEvent stuff for timing?

I figure that I shouldn't use the cuda events, because they are for timing GPU processes and would not be acurate for timing cpu calls (correct me if I am wrong there.) If I could use the cudaEvents on just the mallocManaged one, what would be most accurate to compare against when timing the new call? I just don't know enough about memory allocation and timing. Everything I read seems to just make me more confused due to boost's and nvidia's shoddy documentation.

score 3 · Answer 1 · answered Jul 07 '15 at 08:50

3

You can use CUDA events to measure the time of functions executed in the host.

cudaEventElapsedTime computes the elapsed time between two events (in milliseconds with a resolution of around 0.5 microseconds).

Read more at: http://docs.nvidia.com/cuda/cuda-runtime-api/index.html

In addition, if you are also interested in timing your kernel execution time, you will find that the CUDA event API automatically blocks the execution of your code and waits until any asynchronous call ends (like a kernel call).

In any case, you should use the same metrics (always CUDA events, or boost, or your own timing) to ensure the same resolution and overhead.

The profiler `nvprof' shipped with the CUDA toolkit may help to understand and optimize the performance of your CUDA application.

Read more at: http://docs.nvidia.com/cuda/profiler-users-guide/index.html

answered Jul 07 '15 at 08:50

pQB

3,077
3
23
49

I wasn't sure if the cudaEvent timer stuff was working right because it was reporting that using `new` to make my array was taking 80x longer than using `cuMallocManaged`, which seems wrong. (P.S. Thanks for pointing out the profiler. That looks like it will be very useful for timing everything after the allocation.) – Cory Jul 07 '15 at 17:18
You should add a minimum example showing that behaviour. That may help to get a better answer. – pQB Jul 07 '15 at 19:53
If there is already a (asynchronous) kernel executing in the default stream and I add a cudaEventRecord to the FIFO of the default stream, the cudaEvent will block until the kernel completes before 'registering' itself, right? In other words, it doesn't capture the CPU execution time that took place as long as the kernel was running, right? – nirvanaswap Jul 14 '16 at 18:25

score 1 · Answer 2 · edited May 23 '17 at 10:28

I recommend:

auto t0 = std::chrono::high_resolution_clock::now();
// what you want to measure
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

This will output the difference in seconds represented as a double.

Allocation algorithms usually optimize themselves as they go along. That is, the first allocation is often more expensive than the second because caches of memory are created during the first in anticipation of the second. So you may want to put the thing you're timing in a loop, and average the results.

Some implementations of std::chrono::high_resolution_clock have been less than spectacular, but are improving with time. You can assess your implementation with:

auto t0 = std::chrono::high_resolution_clock::now();
auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration<double>(t1-t0).count() << "s\n";

That is, how fast can your implementation get the current time? If it is slow, two successive calls will demonstrate a large time in-between. On my system (at -O3) this outputs on the order of:

1.2e-07s

which means I can time something that takes on the order of 1 microsecond. To get a finer measurement than that I have to loop over many operations, and divide by the number of operations, subtracting out the loop overhead if that would be significant.

If your implementation of std::chrono::high_resolution_clock appears to be unsatisfactory, you may be able to build your own chrono clock along the lines of this. The disadvantage is obviously a bit of non-portable work. However you get the std::chrono duration and time_point infrastructure for free (time arithmetic and units conversion).

I seem to be getting around 9e-7 seconds. Looping the call will probly get me close enough to what I want. I am just unsure if timing the cuda allocation in this way would catch some kind of synchronization overhead that shouldn't actually be timed. — Cory, Jul 07 '15 at 17:37

Measuring CUDA Allocation time

2 Answers2