1

I just learned stream technique in CUDA, and I tried it. Howerver undesired result returns, namely, the streams are not parallel. (On GPU Tesla M6, OS Red Hat Enterprise Linux 8)

I have a data matrix with size (5,2048), and a kernel to process the matrix.

My plan is to decompose the data in 'nStreams=4' sectors and use 4 streams to parallel the kernel execution.

Part of my code is like the following:

int rows = 5;
int cols = 2048;

int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);

int nStreams = 4;    // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
    checkCudaErrors(cudaStreamCreate(&streams[ii]));
}

int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;

for(int jj=0;jj<nStreams;jj++){
    int offset = jj * streamSize;
    Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
}    // d_Data is the matrix on gpu

Visual Profiler result shows that 4 different streams are not parallel. Stream 13 is the first to work and stream 16 is the last. There is 12.378us between stream 13 and stream 14. And each kernel execution lasts around 5us. In the line of 'Runtime API' above, it says 'cudaLaunch'.

Could you give me some advice? Thanks!

(I don't know how to upload pictures in stackoverflow, so I just describe the result in words.)

DerekLu
  • 79
  • 2
  • 7

2 Answers2

6

First of all, there is no guarantee that stuff launched in separate streams will actually be executed on the GPU in parallel. As pointed out in the programming guide, using multiple streams merely opens up the possibility, you cannot rely on it actually happening. It's up to the driver to decide.

Apart from that, your Tesla M6 has 12 multiprocessors if I'm not mistaken. Each of these 12 Maxwell multiprocessors can hold a maximum of 32 resident blocks. This brings the total maximum number of blocks resident on the entire device to 384. You're launching 320 blocks of 32 threads each. That alone doesn't leave all that much space and you're probably using more than 32 registers per thread so the GPU will be quite full with a single one of these launches, which is most likely why the driver chooses not to run another kernel in parallel.

Parallel kernel launches mainly make sense when you have, e.g., a bunch of small kernels that do different stuff which could run next to each other on separate multiprocessors. It seems that your workload could easily fill the entire device. What exactly are you hoping to achieve by running multiple kernels in parallel? Why are you working with such tiny blocks? Would it not make more sense to launch the whole thing as one big kernel with larger blocks? Normally, you'd want to have at least a couple warps per block. See, e.g., this question for more: How do I choose grid and block dimensions for CUDA kernels? If you're using shared memory, you'll also want at least two blocks per multiprocessor as you otherwise won't even be able to use all of it on some GPUs (which, e.g., offer 96 KiB shared memory per multiprocessor but each block can only have max 48 KiB of that)…

Michael Kenzel
  • 15,508
  • 2
  • 30
  • 39
4

To add to the existing answer (which is completely correct), consider the following trivially complete version of the code you have posted in your question:

__global__
void Mykernel(float* data, int size)
{
    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    for(; tid < size; tid+= blockDim.x * gridDim.x) data[tid] = 54321.f;
}

int main()
{
    int rows = 2048;
    int cols = 2048;

    int blockSize = 32;
    dim3 block(blockSize);

    int nStreams = 4;    // preparation for streams
    cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
    for(int ii=0;ii<nStreams;ii++){
        cudaStreamCreate(&streams[ii]);
    }

    float* d_Data;
    cudaMalloc(&d_Data, sizeof(float) * rows * cols);
    int streamSize = rows * cols / nStreams;
    dim3 streamGrid = dim3(4);

    for(int jj=0;jj<nStreams;jj++){
        int offset = jj * streamSize;
        Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
    }    // d_Data is the matrix on gpu


    cudaDeviceSynchronize();
    cudaDeviceReset();
}

Note two differences -- the number of blocks launched per kernel is reduced, and the amount of total computation per thread is increased by setting rows to 2048. The kernel itself contains a grid-stride loop which allows each thread to process multiple inputs, ensuring that the whole input dataset is processed no matter how many total blocks/threads are launched.

Profiling on a similar Maxwell GPU to your device shows this:

enter image description here

i.e. the kernels do overlap. Now let's reduce the problem size back to the size specified in your question (rows = 5):

enter image description here

The kernels no longer overlap. Why? Because driver and device latency is high enough, and the execution time of each kernel short enough that there is no time for execution overlap to occur, even when device resources would otherwise allow it. So beyond the resource requirement limitations described in the other answer, the volume of computation must be large enough to offset the fixed latency associated with scheduling a kernel launch within a stream.

Finally I would suggest that the correct approach to setting up a stream based concurrent execution scheme should look something like this:

int blockSize = 32;
dim3 block(blockSize);
int blocksperSM, SMperGPU = 13; // GPU specific
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&blocksperSM, Mykernel, blockSize, 0); // kernel specific
dim3 streamGrid = blocksperSM * (SMperGPU / nStreams); // assume SMperGPU >> nstreams   

Here, the idea is that the number of available SMs are (roughly) equally divided amongst the streams, and the number of blocks which maximally occupy each SM for the selected block size is obtained for the kernel via the occupancy API.

This profiles as follows:

enter image description here

which yields both overlap, and short execution times by correctly matching the resource requirements of the kernel to the capacity of the GPU for the case with rows = 2048.

talonmies
  • 70,661
  • 34
  • 192
  • 269