I have a sparse triangular solver that works with 4 Tesla V100 GPUs. I completed implementation and all things work well in terms of accuracy. However, I am using a CPU timer to calculate elapsed time. I know that the CPU timer is not the perfect choice for calculating elapsed time, since I can use CUDA Events.
But the thing is, I do not know how to implement CUDA Events for multi GPU. As I saw from NVIDIA tutorials, they use events for inter-GPU synchronization, i.e. waiting for other GPUs to finish dependencies. Anyway, I define events like;
cudaEvent_t start_events[num_gpus]
cudaEvent_t end_events[num_gpus]
I can also initialize these events in a loop by setting the current GPU iteratively.
And my kernel execution is like;
for(int i = 0; i < num_gpus; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
kernel<<<>>>()
}
for(int i = 0; i < num_devices; i++)
{
CUDA_FUNC_CALL(cudaSetDevice(i));
CUDA_FUNC_CALL(cudaDeviceSynchronize());
}
My question is, how should I use these events to record elapsed times for each GPU separately?