Concurrency of one large kernel with many small kernels and memcopys (CUDA)

Question

I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not involved in communication and computes the rest of the mesh, once communication is done.

I am trying to solve this by having one stream (computeStream) for the long run time kernel (fluxKernel) and one (communicationStream) for the different phases of communication. The computeStream has a very low priority, in order to allow kernels on the communicationStream to interleave the fluxKernel, even though it uses all resources.

These are the streams I am using:

int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
cudaStreamCreateWithPriority (&communicationStream, cudaStreamNonBlocking, priority_high );
cudaStreamCreateWithPriority (&computeStream      , cudaStreamNonBlocking, priority_low  );

The desired cocurrency pattern looks like this:

I need synchronization of the communicationStream before I send the data via MPI, to ensure that the data is completely downloaded, before I send it on.

In the following listing I show the structure of what I am currently doing. First I start the long run time fluxKernel for the main part of the mesh on the computeStream. Then I start a sendKernel that collects the data that should be send to the second GPU and subsequently download it to the host (I cannot use cuda-aware MPI due to hardware limitations). The data is then send non-blocking per MPI_Isend and blocking receive (MPI_recv) is used subsequently. When the data is received the procedure is done backwards. First the data is uploaded to the device and then spread to the main data structure by recvKernel. Finally the fluxKernel is called for the remaining part of the mesh on the communicationStream.

Note, that before and after the shown code kernels are run on the default stream.

{ ... } // Preparations

// Start main part of computatation on first stream

fluxKernel<<< ..., ..., 0, computeStream >>>( /* main Part */ );

// Prepare send data

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );

// MPI Communication

MPI_Isend( ... );
MPI_Recv ( ... );

// Use received data

cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );

recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

fluxKernel<<< ..., ..., 0, communicationStream >>>( /* remaining Part */ );

{ ... } // Rest of the Computations

I used nvprof and Visual Profiler to see, whether the stream actually execute concurrently. This is the result:

I observe that the sendKernel (purple), upload, MPI communication and download are concurrent to the fluxKernel. The recvKernel (red) only starts ofter the other stream is finished, though. Turning of the synchronization does not solve the problem:

For my real application I have not only one communication, but multiple. I tested this with two communications as well. The procedure is:

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

sendKernel<<< ..., ..., 0, communicationStream >>>( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyDeviceToHost, communicationStream );
cudaStreamSynchronize( communicationStream );
MPI_Isend( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

MPI_Recv ( ... );
cudaMemcpyAsync ( ..., ..., ..., cudaMemcpyHostToDevice, communicationStream );
recvKernel<<< ..., ..., 0, communicationStream >>>( ... );

The result is similar to the one with one communication (above), in the sense that the second kernel invocation (this time it is a sendKernel) is delayed till the kernel on the computeStream is finished.

Hence the overall observation is, that the second kernel invocation is delayed, independent of which kernel this is.

Can you explain, why the GPU is synchronizing in this way, or how I can get the second Kernel on communicationStream to also run concurrently to the computeStream?

Thank you very much.

Edit 1: complete rework of the question

Minimal Reproducible Example

I built a minimal reproducible Example. In the end the code plots the int data to the terminal. The correct last value would be 32778 (=(32*1024-1) + 1 + 10). At the beginning I added an option integer to test 3 different options:

0: Intended version with synchronisation before CPU modification of data
1: Same as 0, but without synchronization
2: dedicated stream for memcpys and no syncronization

#include <iostream>

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>

const int option = 0;

const int numberOfEntities = 2 * 1024 * 1024;
const int smallNumberOfEntities = 32 * 1024;

__global__ void longKernel(float* dataDeviceIn, float* dataDeviceOut, int numberOfEntities)
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    float tmp = dataDeviceIn[index];

#pragma unroll
    for( int i = 0; i < 2000; i++ ) tmp += 1.0;

    dataDeviceOut[index] = tmp;
}

__global__ void smallKernel_1( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    smallDeviceData[index] = index;
}

__global__ void smallKernel_2( int* smallDeviceData, int numberOfEntities )
{
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    if(index >= numberOfEntities) return;

    int value = smallDeviceData[index];

    value += 10;

    smallDeviceData[index] = value;
}


int main(int argc, char **argv)
{
    cudaSetDevice(0);

    float* dataDeviceIn;
    float* dataDeviceOut;

    cudaMalloc( &dataDeviceIn , sizeof(float) * numberOfEntities );
    cudaMalloc( &dataDeviceOut, sizeof(float) * numberOfEntities );

    int* smallDataDevice;
    int* smallDataHost;

    cudaMalloc    ( &smallDataDevice, sizeof(int) * smallNumberOfEntities );
    cudaMallocHost( &smallDataHost  , sizeof(int) * smallNumberOfEntities );

    cudaStream_t streamLong;
    cudaStream_t streamSmall;
    cudaStream_t streamCopy;

    int priority_high, priority_low;
    cudaDeviceGetStreamPriorityRange(&priority_low , &priority_high ) ;
    cudaStreamCreateWithPriority (&streamLong , cudaStreamNonBlocking, priority_low  );
    cudaStreamCreateWithPriority (&streamSmall, cudaStreamNonBlocking, priority_high );
    cudaStreamCreateWithPriority (&streamCopy , cudaStreamNonBlocking, priority_high );

    //////////////////////////////////////////////////////////////////////////

    longKernel <<< numberOfEntities / 32, 32, 0, streamLong >>> (dataDeviceIn, dataDeviceOut, numberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    smallKernel_1 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    if( option <= 1 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost, streamCopy  );

    if( option == 0 ) cudaStreamSynchronize( streamSmall );

    // some CPU modification of data
    for( int i = 0; i < smallNumberOfEntities; i++ ) smallDataHost[i] += 1;

    if( option <= 1 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamSmall );
    if( option == 2 ) cudaMemcpyAsync( smallDataDevice, smallDataHost, sizeof(int) * smallNumberOfEntities, cudaMemcpyHostToDevice, streamCopy  );

    smallKernel_2 <<< smallNumberOfEntities / 32, 32, 0 , streamSmall >>> (smallDataDevice, smallNumberOfEntities);

    //////////////////////////////////////////////////////////////////////////

    cudaDeviceSynchronize();

    cudaMemcpy( smallDataHost, smallDataDevice, sizeof(int) * smallNumberOfEntities, cudaMemcpyDeviceToHost );

    for( int i = 0; i < smallNumberOfEntities; i++ ) std::cout << smallDataHost[i] << "\n";

    return 0;
}

With code I see the same behavior as described above:

Option 0 (correct result):

Option 1 (wrong reslut, +1 from CPU missing):

Option 2 (completely wrong result, all 10, dowload before smallKernel_1)

Solutions:

Running Option 0 under Linux (on the suggestion in Roberts answere), brings the expected behavior!

You've got far too many sync calls in your work issuance. Each sync call halts the CPU thread, which means that subsequent work won't get issued until the CPU thread is released. CUDA stream semantics encourage you to place dependent operation in the same stream, and independent operations in separate streams, and in so doing, avoid most usage of sync calls. If you want to provide an actual diagram that shows the dependencies in your work issuance strategy, it may help. It may also help if you identify the actual diagram you would like to see in the visual profiler for the ideal setup. — Robert Crovella, Jul 16 '19 at 16:29
As Robert says of course. If you really want that many streams, you need to synchronize via events, not by halting the host thread. (Sorry not at my computer right now, so can't provide links) — tera, Jul 16 '19 at 22:15
Dear @RobertCrovella, dear tera (it seems I cannot link two users), thank you for your fast response. I completely reworked my question to clarify the problem. I also added a chart of the desired concurrency pattern. The usage of 4 streams in the earlier version of the question originated in me trying several different things. — Lenz, Jul 17 '19 at 08:00
`MPI_Recv` is a blocking operation. I dont think this is the issue, but you know this, right? Also, your visual profiler seems to put the copy operations in the same stream as the communication operations, which your pseudocode does not show. Can you double check your code on that? — Ander Biguri, Jul 17 '19 at 10:57
@AnderBiguri: Thank you for your remarks. I corrected my pseudo code above. For the version shown here, the copy operations are on the `communicationStream`. I tried also versions with separate Streams for the copy operations and with event synchronization. When I do not block the host at any point I get all kernels concurrent to the `computeStream`. Unfortunately I need to block the host before `MPI_Isend`. I tried different methods, all with the effect that the second kernel invocation is posponed, same as above. I am aware that `MPI_recv` is blocking and I actually want that. — Lenz, Jul 17 '19 at 11:09
I fear that there does not seem to be anything clearly wrong in your code, so you will only be able to get further help if you show a [mcve] — Ander Biguri, Jul 17 '19 at 12:19
I think (not sure about this one) that if your asyncronous kernel requires too many resources, and these are being taken by the long kernel, the asynronous one will need to wait to compute. Can this be the problem? — Ander Biguri, Jul 17 '19 at 12:27
It is difficult for me to judge on this, because it is the first time I work with streams. I had exactly this problem in the beginning, that the small kernels only started at the end of the large one. I solved it by setting different priorities for the streams. The `communicationStream` has a much higher priority than the `computeStream` and, hence, allowing the small Kernels to interleave the large Kernel. From my point of view this works. IMO the kernel that is postponed would start during the end of the large kernel and not after, if rescources would be the problem. — Lenz, Jul 17 '19 at 12:33

Robert Crovella · Accepted Answer · 2019-07-17T22:41:18.280

Here's how I would try to accomplish this.

Use a high-priority/low-priority stream arrangement as you suggest.
Only 2 streams should be needed
Make sure to pin host memory to allow compute/copy overlap
Since you don't intend to use cuda-aware MPI, your MPI transactions are purely host activity. Therefore we can use a stream callback to insert this host activity into the high-priority stream.
To allow the high-priority kernels to easily insert themselves into the low-priority kernels, I choose a design strategy of grid-stride-loop for the high priority copy kernels, but non-grid-stride-loop for low priority kernels. We want the low priority kernels to have a larger number of blocks, so that blocks are launching and retiring all the time, easily allowing the GPU block scheduler to insert high-priority blocks as they become available.
The work issuance per "frame" uses no synchronize calls of any kind. I am using a cudaDeviceSynchronize() once per loop/frame, to break (separate) the processing of one frame from the next. Arrangement of activities within a frame is handled entirely with CUDA stream semantics, to enforce serialization for activities which depend on each other, but to allow concurrency for activities that don't.

Here's a sample code that implements these ideas:

#include <iostream>
#include <unistd.h>
#include <cstdio>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

typedef double mt;
const int nTPB = 512;
const size_t ds = 100ULL*1048576;
const size_t bs = 1048576ULL;
const int  my_intensity = 1;
const int loops = 4;
const size_t host_func_delay_us = 100;
const int max_blocks = 320; // chosen based on GPU, could use runtime calls to set this via cudaGetDeviceProperties

template <typename T>
__global__ void fluxKernel(T * __restrict__ d, const size_t n, const int intensity){

  size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x;
  if (idx < n){
    T temp = d[idx];
    for (int i = 0; i < intensity; i++)
      temp = sin(temp);  // just some dummy code to simulate "real work"
    d[idx] = temp;
    }
}

template <typename T>
__global__ void sendKernel(const T * __restrict__ d, const size_t n, T * __restrict__ b){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    b[idx] = d[idx];
}

template <typename T>
__global__ void recvKernel(const T * __restrict__ b, const size_t n, T * __restrict__ d){

  for (size_t idx = ((size_t)blockDim.x) * blockIdx.x + threadIdx.x; idx < n; idx += ((size_t)blockDim.x)*gridDim.x)
    d[idx] = b[idx];
}

void CUDART_CB MyCallback(cudaStream_t stream, cudaError_t status, void *data){
    printf("Loop %lu callback\n", (size_t)data);
    usleep(host_func_delay_us); // simulate: this is where non-cuda-aware MPI calls would go, operating on h_buf
}
int main(){

  // get the range of stream priorities for this device
  int priority_high, priority_low;
  cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
  // create streams with highest and lowest available priorities
  cudaStream_t st_high, st_low;
  cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
  cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);
  // allocations
  mt *h_buf, *d_buf, *d_data;
  cudaMalloc(&d_data, ds*sizeof(d_data[0]));
  cudaMalloc(&d_buf, bs*sizeof(d_buf[0]));
  cudaHostAlloc(&h_buf, bs*sizeof(h_buf[0]), cudaHostAllocDefault);
  cudaCheckErrors("setup error");
  // main processing loop
  for (unsigned long i = 0; i < loops; i++){
    // issue low-priority
    fluxKernel<<<((ds-bs)+nTPB)/nTPB, nTPB,0,st_low>>>(d_data+bs, ds-bs, my_intensity);
    // issue high-priority
    sendKernel<<<max_blocks,nTPB,0,st_high>>>(d_data, bs, d_buf);
    cudaMemcpyAsync(h_buf, d_buf, bs*sizeof(h_buf[0]), cudaMemcpyDeviceToHost, st_high);
    cudaStreamAddCallback(st_high, MyCallback, (void*)i, 0);
    cudaMemcpyAsync(d_buf, h_buf, bs*sizeof(h_buf[0]), cudaMemcpyHostToDevice, st_high);
    recvKernel<<<max_blocks,nTPB,0,st_high>>>(d_buf, bs, d_data);
    fluxKernel<<<((bs)+nTPB)/nTPB, nTPB,0,st_high>>>(d_data, bs, my_intensity);
    cudaDeviceSynchronize();
    cudaCheckErrors("loop error");
    }
  return 0;
}

Here is the visual profiler timeline output (on linux, Tesla V100):

Note that arranging complex concurrency scenarios can be quite challenging on Windows WDDM. I would recommend avoiding that, and this answer does not intend to discuss all the challenges there. I suggest using linux or Windows TCC GPUs to do this.

If you try this code on your machine, you may need to adjust some of the various constants to get things to look like this.

Thank you very much, I will look into it. Did you see my minimal reproducible example? I will also try it on Linux! — Lenz, Jul 17 '19 at 15:07
Actually, running my minimal reproducible example under Linux brings the expected behavior. I add this to my question! — Lenz, Jul 17 '19 at 15:22

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

Minimal Reproducible Example

Solutions:

1 Answers1