88

when is calling to the cudaDeviceSynchronize function really needed?.

As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).

So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.

For example:

  • Is it needed before copying data from the GPU back to the host with cudaMemcpy?

  • If I do matrix multiplications like

    C = A * B
    D = C * F
    

should I put cudaDeviceSynchronize between both?

From my experiment It seems that I don't.

Why does cudaDeviceSynchronize slow the program so much?

nbro
  • 15,395
  • 32
  • 113
  • 196
user1588226
  • 881
  • 1
  • 7
  • 4
  • 2
    One instance would be if you have any print statements in the kernel, the buffer won't print until a synchronization event. – Daniel B. Jul 08 '16 at 15:35

4 Answers4

80

Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.

So, for example,

kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes

So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).

cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.

SRhm
  • 459
  • 1
  • 5
  • 11
aland
  • 4,829
  • 2
  • 24
  • 42
  • The "single default GPU stream unless otherwise specified" is not always held to by nvcc. I just debugged a program where I broke down a lengthy calculation on one kernel into a piecewise calculation that launched kernels one at a time in a for() loop. Successive for() loop kernel launches pick up where the previous for() loop kernel left off device-side. The bug was that the nvcc compiler could not see this from just the host code and tried to launch every kernel at the same time. This meant that all the kernels but the first kernel were computing garbage. – opetrenko Jul 09 '14 at 01:47
  • 1
    @AleksandrDubinsky Please read my comment more carefully. I very explicitly put down "is not always held to by nvcc". I then gave an example of a specific bug I chased down using cuda-gdb that serves as an example proving precisely that. I would definitely agree that based on Nvidia's literature this is not how CUDA is supposed to work... but what I was stating was not an opinion: it was an observation made during debugging about how it worked in a specific instance. – opetrenko Jan 23 '15 at 19:47
  • @opetrenko Sorry for not believing you, but you either discovered an extremely critical bug, or something else was broken in your code. Did you post the problem on SO? – Aleksandr Dubinsky Jan 24 '15 at 08:35
  • @AleksandrDubinsky No problem. In regards to your question I am registered with Nvidia via simulation work I do for physics on a cluster they have donated hardware to. When I see serious issues I post on their CUDA forum where the people who design the nvcc compiler are in a place to fix it. – opetrenko Feb 04 '15 at 18:45
  • @AleksandrDubinsky That said, I would not call the above a critical bug with nvcc having worked with it the past year (I have found far worse). It would be a major bug in gcc but memory/instruction-level parallel work is tricky... the compiler has to be a little overzealous to get speedups. – opetrenko Feb 04 '15 at 19:01
  • @opetrenko Actually, it is the runtime not the compiler that orders kernels. It does not try to be over-zealous. Simply, kernels to the same stream are launched in-order. Perhaps you mean that gcc re-ordered the library calls that launch kernels? Do you have a link to the forum discussion of your problem? – Aleksandr Dubinsky Feb 05 '15 at 08:56
  • @AleksandrDubinsky I am not going to get into a "it was the program that did it not the compiler" because when all is said and done the nvcc compiler created code that did not enforce a strict order in a host function. The code in question was a host function (return type void) in a .cu file which was by definition compiled and linked with nvcc. gcc was not involved, period. I said I considered this a non-serious bug. Also, I have already said I do not post non-serious bugs on the Nvidia forum. I do not know why you are requesting a link to a discussion that does not exist. – opetrenko Feb 06 '15 at 03:35
  • 1
    @opetrenko NVCC is not a compiler of host code. It is a pre-processor that hands off to the system's compiler (gcc). You are misinformed on many aspects of CUDA. Rather than posting misleading information on SO and hurting others, you should post questions about things you don't know or the problems you run into. Perhaps someone could have clarified why your code appeared to become fixed by the addition of superfluous `cudaDeviceSynchronize` calls. – Aleksandr Dubinsky Feb 06 '15 at 17:52
  • @AleksandrDubinsky Your attitude is uncalled for. Despite that, I can grant you are correct in correcting me on gcc (for linux) being involved after the .cu file is preprocessed by nvcc into separate compilation trees. That is no excuse for generalizing about someone's knowledge or talking down to them while in the act of addressing the legitimate point (.cu going first to nvcc by definition) by giving a specific argument (i.e. nvcc only taking the step of being a pre-processor for host code then handing it off to the host compiler). – opetrenko Feb 10 '15 at 03:46
  • 1
    @opetrenko Please accept my attitude as the result of trying to politely ask you not to publicly make improbable claims (which could severely confuse a novice who visits this page) such as, "CUDA ignores its specification and launches kernels out of order" without having sufficiently investigated the more likely scenario that the bug is in your own code, and being met with obstinence. – Aleksandr Dubinsky Feb 11 '15 at 16:55
  • 1
    @AleksandrDubinsky Fair enough. To keep things on a productive note, I would say my attitude with NVCC after studying the documentation and building simulations is this: To assume the macro "might" take liberties about what should be the host code in a .cu file (before handing it to the host compiler). At times I understand making programs that are contained in a single .cu/.cuh file pair and that nvcc should, in theory, not cause problems for host code. But I found in practice using wrappers made a LOT of issues go away while simultaneously making things more modular (where appropriate). – opetrenko Feb 14 '15 at 23:50
  • @AleksandrDubinsky Just thought I would add a few more relevant findings. I have had a significant number of bugs resolve when moving host code to non-CUDA header and source files. NVCC will still link them but there is no labeling of host code using the nvcc preprocessor flags in .cu or .cuh files. I would conclude that the original point stands. To restate that point: Just because NVCC theoretically hands off host functions in .cu/.cuh to GCC does NOT mean that it works out that way in practice. – opetrenko Jun 09 '15 at 22:32
  • @opetrenko I just experienced the same problem, launching kernels one at a time in a for loop. The exact same code runs fine on Linux but fails on Windows. I cannot believe I need a `cudaDeviceSynchronize` to solve it. – user3667089 Feb 08 '17 at 23:15
  • 2
    @user3667089 and anyone else. The far, far more likely scenario is opetrenko's and user3667089's problems are due to code mistakes, and not errors in NVCC and the CUDA API. Lacking working example code that reproduces, this debate reduces to opinion. – Tyson Hilmer Aug 30 '17 at 07:34
19

One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).

limes
  • 608
  • 5
  • 9
13

When you want your GPU to start processing some data, you typically do a kernal invocation. When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.

In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.

Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
Orpedo
  • 483
  • 5
  • 14
2

You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).

From this post CUDA Dynamic Parallelism API and Principles:

If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.

...

Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.

__device__ int v = 0;

__global__ void child_k(void) {
  printf("v = %d\n", v);
}

__global__ void parent_k(void) {
  v = 1;
  child_k <<< 1, 1 >>>> ();
  v = 2; // RACE CONDITION
  cudaDeviceSynchronize();
}
dontloo
  • 10,067
  • 4
  • 29
  • 50