As said here: How to reduce CUDA synchronize latency / delay
There are two approach for waiting result from device:
- "Polling" - burn CPU in spin - to decrease latency when we wait result
- "Blocking" - thread is sleeping until an interrupt occurs - to increase general performance
For "Polling" need to use CudaDeviceScheduleSpin
.
But for "Blocking" what do I need to use CudaDeviceScheduleYield
or cudaDeviceScheduleBlockingSync
?
What difference between cudaDeviceScheduleBlockingSync
and cudaDeviceScheduleYield
?
cudaDeviceScheduleYield
as written: http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g18074e885b4d89f5a0fe1beab589e0c8.html
"Instruct CUDA to yield its thread when waiting for results from the device. This can increase latency when waiting for the device, but can increase the performance of CPU threads performing work in parallel with the device." - i.e. wait result without burn CPU in spin - i.e. "Blocking". And cudaDeviceScheduleBlockingSync too - wait result without burn CPU in spin. But what difference?