Reading global flag does not work for CPU>GPU data exchange in CUDA

Question

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.

This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).

On the host side, global definition (.cpp):

// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;

On the host side in the compute thread (.cpp):

initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);

On the host side in a main thread (.cpp):

cancel(cancelRequested); // Called after init is finished

Host routines (.cu file):

void initialize(volatile bool** pCancelRequested)
{
   cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
   const bool aFalse = false;
   cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}

void compute(volatile bool* pCancelRequested) 
{
   ....
   computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
   cudaDeviceSynchronize(); // Non-busy wait
   ....
}

void finalize(volatile bool** pCancelRequested)
{
   cudaFree(*const_cast<bool**>(pCancelRequested));
   *pCancelRequested = nullptr;
}

void cancel(volatile bool* pCancelRequested)
{
   const bool aTrue = true;
   cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}

Device routines (.cu file):

__global__ void computeKernel(volatile bool* pCancelRequested)
{
   while (someCondition)
   {
      // Computation step here

      if (*pCancelRequested)
      {
         printf("-> Cancel requested!\n");
         return;
      }
   }
}

The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.

Any idea why this doesn't work?

`volatile` makes zero guarantees regarding atomicity or visibility or access ordering between threads - it's **not** suitable for multithread synchronization. See [**Does "volatile" guarantee anything at all in portable C code for multi-core systems?**](https://stackoverflow.com/questions/58695320/does-volatile-guarantee-anything-at-all-in-portable-c-code-for-multi-core-syst), or, more directly to the point [**Is `volatile` Useful with Threads**](http://isvolatileusefulwiththreads.in/cplusplus/) — Andrew Henle, Feb 08 '23 at 12:45
Also, read this: https://web.archive.org/web/20120210232850/http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming/ and this https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2016.html if you're not convinced `volatile` is "almost useless for multithreaded programming". — Andrew Henle, Feb 08 '23 at 12:58
@AndrewHenle I'm aware of this. I do NOT use volatile for any multi threading purpose. (I'm protection the critical pointer by application logic and sequential access). I'm using volatile only in its original purpose: telling the nvcc compiler that the GPU's global memory can change any time since it is accessed by a separated hardware, namely the CPU and it should ignore caches but do a costly global memory access instead. This is also what is recommended in the CUDA documentation — Silicomancer, Feb 08 '23 at 13:09
amongst possibly other problems, `cudaMemcpy` cannot run concurrently with your kernel as you have it currently written. This is due to CUDA streams. There are a number of concepts needed to make this kind of signalling work, [here](https://stackoverflow.com/questions/20345702/how-can-i-check-the-progress-of-matrix-multiplication/20381924#20381924) is an example that covers most, although its not the only way. And yes I understand it represents signalling in the other direction, I'm not suggesting its a complete solution. And if you are trying to do this in windows WDDM its extra-difficult — Robert Crovella, Feb 08 '23 at 15:16
@RobertCrovella I'm using Linux. Could you go a bit more into detail? The example has not much explanations or comments. When looking at that code I'm not sure what the crucial point is. It uses similar memory polling as my code. I see 3 differences: 1) Use of mapped pinned memory 2) __threadfence_system() calls 3) atomicAdd() access. — Silicomancer, Feb 08 '23 at 21:19
2) seems to apply in GPU->CPU direction only (since __threadfence_system() is device-only) and 3) seems not to apply since I need a simple write at one side and a simple read at the other side, not a RMW operation. Does this mean mapped pinned memory is the key? — Silicomancer, Feb 08 '23 at 21:20
Your question is why this doesn't work. Do you know anything about CUDA streams? It matters. Yes, mapped pinned memory is one way to work around the limitation that CUDA streams places on the way you are trying to communicate. — Robert Crovella, Feb 08 '23 at 21:59
Ok, thanks. I once looked into streams but I never used them (i.e. I always used the default stream) since I didn't need them for my project. I know even less about mapped pinned memory. I will read more. — Silicomancer, Feb 08 '23 at 22:25

Robert Crovella · Accepted Answer · 2023-02-08T23:06:05.973

The fundamental problem I see with your approach is that cuda streams will prevent it from working.

CUDA streams have two basic principles:

Items issued into the same stream will not overlap; they will serialize.
Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.

Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.

I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series

Because of CUDA streams, this:

 computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);

and this:

 cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);

could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.

We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:

$ cat t2184.cu
#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag, *h_flag = new int;
  cudaStream_t s[2];
  cudaStreamCreate(s+0);
  cudaStreamCreate(s+1);
  cudaMalloc(&flag, sizeof(h_flag[0]));
  *h_flag = 1;
  cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
  k<<<32, 256, 0, s[0]>>>(flag);
  sleep(5);
  *h_flag = 0;
  cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
  cudaDeviceSynchronize();
}

$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$

NOTES:

Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.

This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:

#include <iostream>
#include <unistd.h>

__global__ void k(volatile int *flag){

  while (*flag != 0);
}

int main(){

  int *flag;
  cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
  *flag = 1;
  k<<<32, 256>>>(flag);
  sleep(5);
  *flag = 0;
  cudaDeviceSynchronize();
}

For other examples of signalling, such as from device to host, other readers may be interested in this.

Just reading and found this overview: https://medium.com/analytics-vidhya/cuda-memory-model-823f02cef0bf From your post I understood that pinned host memory is visible to the device as well and that the pointer can directly be passed to and accessed by the kernel. However in the above overview the author uses an explicit cudaMemcpy() for pinned memory where I would have expected a pointer parameter and a plain access, does this make sense to you? — Silicomancer, Feb 09 '23 at 12:11
Yes, it makes sense to me. Pinned memory can be used in multiple ways; there are other important uses for it besides what I have shown here. Perhaps you should study the online training session I linked? I'm not going to respond to questions here that are already covered in that online training session. — Robert Crovella, Feb 09 '23 at 14:53

Reading global flag does not work for CPU>GPU data exchange in CUDA

1 Answers1

Linked