cudaEventSynchronize vs cudaDeviceSynchronize

Question

I am new to CUDA and got a little confused with cudaEvent. I now have a code sample that goes as follows:

float elapsedTime; 
cudaEvent_t start, stop;
CUDA_ERR_CHECK(cudaEventCreate(&start));
CUDA_ERR_CHECK(cudaEventCreate(&stop));

CUDA_ERR_CHECK(cudaEventRecord(start));

// Kernel functions go here ...

CUDA_ERR_CHECK(cudaEventRecord(stop));
CUDA_ERR_CHECK(cudaEventSynchronize(stop));
CUDA_ERR_CHECK(cudaEventElapsedTime(&elapsedTime, start, stop));

CUDA_ERR_CHECK(cudaDeviceSynchronize());

I have two questions regarding this code:

1.Is the last cudaDeviceSynchronize necessary? Because according to the documentation for cudaEventSynchronize, its functionality is Wait until the completion of all device work preceding the most recent call to cudaEventRecord(). So given that we have already called cudaEventSynchronize(stop), do we need to call cudaDeviceSynchronize once again?

2.How different is the above code compared to the following implementation:

#include <chrono>

auto tic = std::chrono::system_clock::now();

// Kernel functions go here ...

CUDA_ERR_CHECK(cudaDeviceSynchronize());
auto toc = std::chrono::system_clock:now();

float elapsedTime = std::chrono::duration_cast < std::chrono::milliseconds > (toc - tic).count() * 1.0;

1. it's not necessary 2. one is using cudaEvent based timing, the other is using CPU based timing. The CPU based timing may include overheads and latencies that are not in the cudaEvent based timing, but these differences are often small or insignificant. — Robert Crovella, Nov 07 '17 at 22:55
What is exactly `cudaEventSynchronize` doing? The docs says "wait for an event (stop) to complete". I can see that the kernel executes parallel but do each of them have a "stop"? Or it waits for all parallel kernel executions to complete before measuring "stop"? — KansaiRobot, Jun 15 '21 at 01:45

talonmies · Accepted Answer · 2017-11-08T12:09:46.213

Just to flesh out comments so that this question has an answer and will fall off the unanswered queue:

No, the cudaDeviceSynchronize() call is not necessary. In fact, in many cases where asynchronous API calls are being used in multiple streams, it is incorrect to use a global scope synchronization call, because you will break the features of event timers which allow accurate timing of operations in streams.
They are completely different. One is using host side timing, the other is using device driver timing. In the simplest cases, the time measured by both will be comparable. However, in the host side timing version, if you put a host CPU operation which consumes a significant amount of time in the host timing section, your measurement of time will not reflect the GPU time used when the GPU operations take less time than the host operations.

cudaEventSynchronize vs cudaDeviceSynchronize

1 Answers1