0

I would like to create a basic CUDA application to demonstrate the memory transfer/kernel execution overlapping for students. But using the nvvp, it seems that there is no concurrent execution. Can you help me what is wrong?

The full source (Visual Studio 2015, CUDA 8.0, sm3.5,arch3.5, Titan X card):

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <malloc.h>
#include <stdio.h>

#define MEMSIZE 8000000
#define STREAM_N 8

__global__ void TestKernel(char *img)
{
    int pos = blockIdx.x * blockDim.x + threadIdx.x;
    for (int k = 0; k < 100; k++)
        img[pos] = img[pos] / 2 + 128;
}

int main()
{
    // allocate memory and streams
    char *img[STREAM_N];
    char *d_img[STREAM_N];
    cudaStream_t streams[STREAM_N];

    for (int pi = 0; pi < STREAM_N; pi++)
    {
        cudaMalloc((void**)&d_img[pi], MEMSIZE / STREAM_N);
        cudaMallocHost((void**)&img[pi], MEMSIZE / STREAM_N);
        cudaStreamCreate(&streams[pi]);
    }

    // process packages one way
    cudaError_t stat;
    for (int pi = 0; pi < STREAM_N; pi++)
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
    for (int pi = 0; pi < STREAM_N; pi++)
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
    for (int pi = 0; pi < STREAM_N; pi++)
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);

    // process packages another way
    for (int pi = 0; pi < STREAM_N; pi++) 
    {
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
    }
    cudaDeviceSynchronize();

    // destroy streams and free memory
    for (int pi = 0; pi < STREAM_N; pi++)
    {
        cudaStreamDestroy(streams[pi]);
        cudaFreeHost(img[pi]);
        cudaFree(d_img[pi]);
    }
}

And the visual profiler output:

performance analysis

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
Sanyo
  • 71
  • 1
  • 6
  • http://stackoverflow.com/a/35010019/681865 – talonmies Sep 05 '16 at 14:27
  • Thanks, I have read this question, but it does not help me. One of the presented solutions is very similar to my code (but using multiple GPUs). But what do you think, in my case, what prevents concurent execution? – Sanyo Sep 05 '16 at 14:41
  • 1
    What kind of concurrency are you expecting or wanting to see specifically? There are 4 types of typical execution "types" within a CUDA architecture that may overlap: 1. CPU activity 2. GPU activity 3. D->H transfers 4. H->D transfers. You might be specific in which overlaps you are looking for: 2+2? 2+3? 2+4? others? There are separate rules and requirements in each case. You certainly will not see 2+2 overlap with your code since you are launching 2500 blocks per kernel launch, which effectively prevents concurrent execution of kernels with other kernels. – Robert Crovella Sep 05 '16 at 16:55
  • You may also wish to study the CUDA concurrent kernels sample code, as it demonstrates a number of necessary and useful concepts. – Robert Crovella Sep 05 '16 at 16:56
  • I have 4 independent streams with the same H2D -> kernel -> D2H operations, so I expect something like this: stream 1 H2D copy | stream 1 kernel launch, concurrently stream 2 H2D copy starts | stream 2 kernel launch, concurrently stream 3 H2D copy starts ... and similarly, when the kernels stop, the D2H copies can start immediately, while other kernels are still running. I have read the corresponding chapters of CUDA books about overlapping, but based on these information, the code above have to work... It is so easy to draw some figures about overlapping, but I need a working exe. – Sanyo Sep 05 '16 at 19:18
  • 1
    It's possible that WDDM command batching is getting in your way. Since you have a Titan card, you could try putting it in TCC mode instead of WDDM mode. Alternatively, run your code on linux. When I run your code as-is on linux, I witness overlap of copy operations with compute operations. [Here](http://imgur.com/a/CMceI) is what I see in `nvvp`. Also note that Titan X is not an sm_35 card. – Robert Crovella Sep 05 '16 at 20:16
  • Thanks! I have switched to TCC mode and it works. – Sanyo Sep 06 '16 at 11:35
  • @Sanyo: If that is the case, could you please write a short answer explaining what you did to fix the problem. It is OK to answer your own questions on [SO] – talonmies Sep 06 '16 at 13:05
  • If you don't want to switch to the TCC driver, a call to cudaStreamQuery() can also force immediate kernel launch without further batching. – tera Sep 06 '16 at 13:58
  • Ah - should have looked at your answer first, you've already found out about that. – tera Sep 06 '16 at 13:58

1 Answers1

2

WDDM command batching caused the problem. The best solution is to switch the operating mode of the card from WDDM to TCC. This can be done via the nvidia-smi command.

nvidia-smi -i <gpu_id> -dm 1

This solved my problem. The pattern I would like to see: timeline

An alternative solution is manually flushing the command queue using cudaStreamQuery (source), like:

for (int pi = 0; pi < STREAM_N; pi++) 
    {
        cudaMemcpyAsync(d_img[pi], img[pi], MEMSIZE / STREAM_N, cudaMemcpyHostToDevice, streams[pi]);
        TestKernel <<< MEMSIZE / STREAM_N / 400, 400, 0, streams[pi] >>>(d_img[pi]);
        cudaMemcpyAsync(img[pi], d_img[pi], MEMSIZE / STREAM_N, cudaMemcpyDeviceToHost, streams[pi]);
        cudaStreamQuery(streams[pi]); // FLUSH COMMAND QUEUE
    }
Sanyo
  • 71
  • 1
  • 6