I just learned stream technique in CUDA, and I tried it. Howerver undesired result returns, namely, the streams are not parallel. (On GPU Tesla M6, OS Red Hat Enterprise Linux 8)
I have a data matrix with size (5,2048), and a kernel to process the matrix.
My plan is to decompose the data in 'nStreams=4' sectors and use 4 streams to parallel the kernel execution.
Part of my code is like the following:
int rows = 5;
int cols = 2048;
int blockSize = 32;
int gridSize = (rows*cols) / blockSize;
dim3 block(blockSize);
dim3 grid(gridSize);
int nStreams = 4; // preparation for streams
cudaStream_t *streams = (cudaStream_t *)malloc(nStreams * sizeof(cudaStream_t));
for(int ii=0;ii<nStreams;ii++){
checkCudaErrors(cudaStreamCreate(&streams[ii]));
}
int streamSize = rows * cols / nStreams;
dim3 streamGrid = streamSize/blockSize;
for(int jj=0;jj<nStreams;jj++){
int offset = jj * streamSize;
Mykernel<<<streamGrid,block,0,streams[jj]>>>(&d_Data[offset],streamSize);
} // d_Data is the matrix on gpu
Visual Profiler result shows that 4 different streams are not parallel. Stream 13 is the first to work and stream 16 is the last. There is 12.378us between stream 13 and stream 14. And each kernel execution lasts around 5us. In the line of 'Runtime API' above, it says 'cudaLaunch'.
Could you give me some advice? Thanks!
(I don't know how to upload pictures in stackoverflow, so I just describe the result in words.)