Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

Question

I have a Monte Carlo simulation in which the state of the system is a bit string (size N) with the bits being randomly flipped. In an effort to accelerate the simulation the code was revised to use CUDA. However because of the large number of statistics I need calculated from the system state (goes as N^2) this part needs to be done on the CPU where there is more memory. Currently the algorithm looks like this:

loop
  CUDA kernel making 10s of Monte Carlo steps
  Copy system state back to CPU
  Calculate statistics

This is inefficient and I would like to have the kernel run persistently while the CPU occasionally queries the state of the system and calculates the statistics while the kernel continues to run.

Based on Tom's answer to this question I think the answer is double buffering, but I haven't been able to find an explanation or example of how to do this.

How does one set up the double buffering described in the third paragraph of Tom's answer for a CUDA/C++ code?

your question is very broad; how much GPU / CPU memory are you using? how long does copying take from GPU to CPU? did you benchmark it? what kind of "large number of statistics" do have to be performed by the CPU? explain you problem in more detail, ideally provide a [MCVE] — m.s., Oct 15 '15 at 13:45
As @m.s says, this seems to be a very broad question. From my perspective, the only reasonable answers would either be a complete code set or a description of how to write that complete code set, which would have to cover a lot of topics. I don't see any need for double-buffering in the basic outline you have shown. Effectively, double-buffering of a sort will occur in the device-to-host copy. This [question/answer](http://stackoverflow.com/questions/20345702/how-can-i-check-the-progress-of-matrix-multiplication) might give you some clues. — Robert Crovella, Oct 15 '15 at 14:20
As *m.s.* and *Robert Crovella* said, the question is unanswerable in this form. In my view, any allocation of device memory is already double buffered by it's host copy. More buffering will probably lead to increased memory consumption and bandwidth usage. Even though I don't understand the entire problem you are trying to solve, my gut feelings tell me that you can use some [CUDA Streams](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#streams). They will allow you to run GPU code asynchronously. Note that it often leads to significant increase of code complexity and concurrency bugs — Ivan Aksamentov - Drop, Oct 15 '15 at 15:37
I don't think any of you looked at Tom's answer on the link. I want to know how to do what he is describing in the third paragraph of his answer. If the question is broad it is because his answer is vague. People mention double buffering like it is some standard produce but noone can give an example of how to do it. — Danielle Eaton Hart, Oct 15 '15 at 16:32
@m. s. The state of the system is going to be O(100) ints/per thread, but in order to make the MC steps the GPU has to have another O(10^6) floats. In addition to this the CPU has another O(10^6) doubles it is calculating from the state (the "large number of statistics"). I haven't benchmarked the coping time because that is not what I am concerned with. I don't want to stop the kernel very 25 steps to take a sample and wait for the CPU to do its calculations and then start the kernel again. I want the CPU and GPU to run at the same time, but need data from the GPU occusionally. — Danielle Eaton Hart, Oct 15 '15 at 16:38
@Robert Crovella My understanding of double buffering is that there are two copies of the data, one that the GPU is writing too and one that the CPU is reading. Which buffer is which switches occasionaly in a way that doesn't currupt the data. currently the kernel stops after a very small amount of computation to pass data to the CPU and for the CPU to use it and then the kernel starts again. I want the CPU to grab data from the kernal while it continues to run. — Danielle Eaton Hart, Oct 15 '15 at 16:44
@Drop I don't think the question is unclear or unanswerable. I want to know how to grab the updated date from the GPU without interupting (stoping and restarting) the kernel as Tom suggests in [this answer](http://stackoverflow.com/questions/33150040/doubling-buffering-in-cuda-so-the-cpu-can-operate-on-data-produced-by-a-http://stackoverflow.com/a/23424352/1572463). — Danielle Eaton Hart, Oct 15 '15 at 16:49
@DanielleEatonHart Well, [double buffering](https://en.wikipedia.org/wiki/Multiple_buffering) is well known in computer graphics, but is somewhat vague notion is non-rendering apps. I think you better ask Tom about what he meant. In my opinion, your understanding is correct. What you want is streams. With streams you would not even stop GPU, because you can memcopy from/to processed buffer asynchronously. Another thing is that if GPU and CPU are writing/reading different locations of the same buffer, you can work on a portion you are interested in, so no multiple buffering required. — Ivan Aksamentov - Drop, Oct 15 '15 at 17:03
A persistent kernel, which you have casually mentioned in your question, is a non-trivial thing to write correctly. Are you fully versed on persistent kernel design? Or do you need that explained also? And, as @Drop has said, it might be possible to realize a very efficient double-buffered approach that does not depend on persistent kernels but instead depends on ping-pong kernel launches. — Robert Crovella, Oct 15 '15 at 17:06

Robert Crovella · Answer 1 · 2019-08-07T19:33:53.273

Here's a fully worked example of a "persistent" kernel, producer-consumer approach, with a double-buffered interface from device (producer) to host (consumer).

Persistent kernel design generally implies launching kernels with, at most, the number of blocks that can be simultaneously resident on the hardware (see item 1 on slide 16 here). For the most efficient usage of the machine, we'd generally like to maximize this, while still staying within the aforementioned limit. This involves an occupancy study for a specific kernel, and it will vary from kernel to kernel. Therefore I've chosen to take a shortcut here, and simply launch as many blocks as there are multiprocessors. Such an approach is always guaranteed to work (it could be considered a "lower bound" on the number of blocks to launch for a persistent kernel), but is (typically) not the most efficient usage of the machine. Nevertheless, I claim the occupancy study is beside the point of your question. Furthermore, it is arguable that proper "persistent kernel" design with guaranteed forward progress is actually quite tricky - requiring careful design of the CUDA thread code and placement of threadblocks (e.g. only use 1 threadblock per SM) to guarantee forward progress. However we don't need to delve to this level to address your question (I don't think) and the persistent kernel example I propose here only places 1 threadblock per SM.

I'm also assuming a proper UVA setup, so that I can skip the details of arranging for proper mapped memory allocations in an non-UVA setup.

The basic idea is that we will have 2 buffers on the device, along with 2 "mailboxes" in mapped memory, one for each buffer. The device kernel will fill a buffer with data, then set the "mailbox" to a value (2, in this case) that indicates the host may "consume" the buffer. The device then goes on to the other buffer and repeats the process in a ping-pong fashion between buffers. In order to make this work we must make sure that the device itself has not overrun the buffers (no thread is allowed to be more than one buffer ahead of any other thread) and that before a buffer is populated by the device, the host has consumed the previous contents.

On the host side, it is simply waiting for the mailbox to indicate "full", then copying the buffer from device to host, reset the mailbox, and perform the "processing" on it (the validate function). It then goes on to the next buffer in a ping-pong fashion. The actual data "production" by the device is just to fill each buffer with the iteration number. The host then checks to see that the proper iteration number was received.

I've structured the code to call out the actual device "work" function (my_compute_function) which is where you would put whatever your Monte Carlo code is. If your code is nicely thread-independent, this should be straightforward. Thus the device side my_compute_function is the producer function, and the host side validate is the consumer function. If your device producer code is not simply thread independent, then you may need to restructure things slightly around the calling point to my_compute_function.

The net effect of this is that the device can "race ahead" and begin filling the next buffer, while the host is "consuming" the data in the previous buffer.

Because persistent kernel design imposes an upper bound on the number of blocks (and threads) in a kernel launch, I've chosen to implement the "work" producer function in a grid-striding loop, so that arbitrary size buffers can be handled by the given grid-width.

Here's a fully worked example:

$ cat t942.cu
#include <stdio.h>

#define ITERS 1000
#define DSIZE 65536
#define nTPB 256

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;

__device__ void my_compute_function(int *buf, int idx, int data){
  buf[idx] = data;  // put your work code here
}

__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready,  const int buffersize, const int iterations){
  // assumption of persistent block-limited kernel launch
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int iter_count = 0;
  while (iter_count < iterations ){ // persistent until iterations complete
    int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
    volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
    volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
    int my_idx = idx;
    while (iter_count - itercnt > 1); // don't overrun buffers on device
    while (*bufrdy == 2);  // wait for buffer to be consumed
    while (my_idx < buffersize){ // perform the "work"
      my_compute_function(buf, my_idx, iter_count);
      my_idx += gridDim.x*blockDim.x; // grid-striding loop
      }
    __syncthreads(); // wait for my block to finish
    __threadfence(); // make sure global buffer writes are "visible"
    if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
    if (!idx){ // am I the master block/thread?
      while (*blkcnt < gridDim.x);  // wait for all blocks to finish
      *blkcnt = 0;
      *bufrdy = 2;  // indicate that buffer is ready
      __threadfence_system(); // push it out to mapped memory
      itercnt++;
      }
    iter_count++;
    }
}

int validate(const int *data, const int dsize, const int val){

  for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
  return 1;
}

int main(){

  int *h_buf1, *d_buf1, *h_buf2, *d_buf2;
  volatile int *m_bufrdy1, *m_bufrdy2;
  // buffer and "mailbox" setup
  cudaHostAlloc(&h_buf1, DSIZE*sizeof(int), cudaHostAllocDefault);
  cudaHostAlloc(&h_buf2, DSIZE*sizeof(int), cudaHostAllocDefault);
  cudaHostAlloc(&m_bufrdy1, sizeof(int), cudaHostAllocMapped);
  cudaHostAlloc(&m_bufrdy2, sizeof(int), cudaHostAllocMapped);
  cudaCheckErrors("cudaHostAlloc fail");
  cudaMalloc(&d_buf1, DSIZE*sizeof(int));
  cudaMalloc(&d_buf2, DSIZE*sizeof(int));
  cudaCheckErrors("cudaMalloc fail");
  cudaStream_t streamk, streamc;
  cudaStreamCreate(&streamk);
  cudaStreamCreate(&streamc);
  cudaCheckErrors("cudaStreamCreate fail");
  *m_bufrdy1 = 0;
  *m_bufrdy2 = 0;
  cudaMemset(d_buf1, 0xFF, DSIZE*sizeof(int));
  cudaMemset(d_buf2, 0xFF, DSIZE*sizeof(int));
  cudaCheckErrors("cudaMemset fail");
  // inefficient crutch for choosing number of blocks
  int nblock = 0;
  cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
  cudaCheckErrors("get multiprocessor count fail");
  testkernel<<<nblock, nTPB, 0, streamk>>>(d_buf1, d_buf2, m_bufrdy1, m_bufrdy2, DSIZE, ITERS);
  cudaCheckErrors("kernel launch fail");
  volatile int *bufrdy;
  int *hbuf, *dbuf;
  for (int i = 0; i < ITERS; i++){
    if (i & 1){  // ping pong on the host side
      bufrdy = m_bufrdy2;
      hbuf = h_buf2;
      dbuf = d_buf2;}
    else {
      bufrdy = m_bufrdy1;
      hbuf = h_buf1;
      dbuf = d_buf1;}
    // int qq = 0; // add for failsafe - otherwise a machine failure can hang
    while ((*bufrdy)!= 2); // use this for a failsafe:  if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
    cudaMemcpyAsync(hbuf, dbuf, DSIZE*sizeof(int), cudaMemcpyDeviceToHost, streamc);
    cudaStreamSynchronize(streamc);
    cudaCheckErrors("cudaMemcpyAsync fail");
    *bufrdy = 0; // release buffer back to device
    if (!validate(hbuf, DSIZE, i)) {printf("validation failure at iter %d\n", i); exit(1);}
    }
 printf("Completed %d iterations successfully\n", ITERS);
}


$ nvcc -o t942 t942.cu
$ ./t942
Completed 1000 iterations successfully
$

I've tested the above code and it seems to work well on linux. I believe it should be OK on a windows TCC setup. On windows WDDM, however, I think there are issues that I am still investigating.

Note that the above kernel design attempts to do a grid-wide synchronization using a block-counting atomic strategy. CUDA now (9.0 and newer) has cooperative groups, and that is the recommended approach, rather than the above methodology, to create a grid-wide sync.

score 0 · Answer 2 · answered Oct 15 '15 at 17:54

This isn't a direct answer to your question but it may be of help.

I am working with a CUDA producer-consumer code that appears to be similar in basic structure to yours. I was hoping to speed up the code by making the CPU and GPU run concurrently. I attempted this by restructuring the code this why

Launch kernel
Copy data
Loop
  Launch kernel
  CPU work
  Copy data
CPU work

This way the CPU can work on the data from the last kernel run while the next set of data is being generated. This cut 30% off the runtime of my code. I am guess thing it could get better if the GPU/CPU work can be balanced so they take roughly the same amount of time.

I am still launching the same kernel 1000s of times. If the overhead of launching a kernel repeatedly is significant then looking for a way to do what I have accomplish with a single launch would be worth it. Otherwise this is probably the best (simplest) solution.

This would be a pipelined algorithm. A fully-worked example is [here](http://stackoverflow.com/questions/31186926/multithreading-for-image-processing-at-gpu-using-cuda/31188999#31188999). — Robert Crovella, Oct 15 '15 at 18:33

Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel

2 Answers2

Linked