Streaming data to CUDA kernels without multiple launches

Question

I am trying to GPU accelerate an algorithm where I receive an asynchronous stream of particles in 3D space $p=[x,y,t]$. Each vector $p_n$ needs to be multiplied by a bunch of transformation matrices. Since these transformations are independent of each other they can happen in parallel, so I have written a CUDA kernel to do that. It works well, but of course for each incoming $p_n$ I end up launching the CUDA kernel anew. Launching a CUDA kernels carries a major time penalty, and thus I lose the advantage of GPU acceleration. So my question is, can I keep the kernel open and stream the particles to it somehow?

In case it's any help here is my current kernel:

__global__
void project(float *projection_matrix, float *vector, float *output_matrix) {
    int col_index = blockIdx.x * blockDim.x + threadIdx.x;
    int row_index = blockIdx.y * blockDim.x + threadIdx.y;
    int output_index = (col_index*3 + threadIdx.y);
    int transform_first_element = col_index * 9 + threadIdx.y * 3;
    int stride = blockDim.x*blockDim.y*gridDim.x;

    while (output_index < (NUMBER_OF_TRANSFORMS * 3)) {
        output_matrix[output_index] = projection_matrix[transform_first_element]*vector[0]+ projection_matrix[(transform_first_element+1)]*vector[1] + projection_matrix[(transform_first_element+2)]*vector[2];
        output_index += stride;
    }
}

and this is where I call it:

...
project <<<num_blocks_dim, block_dim >>> (transformationList, inputVector, outputMatrix);
cudaDeviceSynchronize();
...

Yes, you can keep the kernel "open" and stream particles to it. It is referred to as a "persistent kernels" design paradigm, and there are papers written about it as well as questions here on SO about it. — Robert Crovella, Oct 26 '17 at 08:58
@RobertCrovella - Thanks for your reply, it gave me a lot of good keywords to follow up on that I would otherwise not have known about. I came across this nicely laid out example - written by you in fact! (https://stackoverflow.com/questions/33150040/doubling-buffering-in-cuda-so-the-cpu-can-operate-on-data-produced-by-a-persiste/33158954#33158954). Is this kind of producer/consumer design something you would recommend for my use case then? Would one do it any differently now than "back" in 2015? — Mr Squid, Oct 27 '17 at 02:16
I don't see enough description of your use case to make a recommendation. I was mainly responding to the question "can I keep the kernel open and stream the particles to it somehow?". The change I would make today vs 2015 would be to use cooperative groups - a new feature in CUDA 9 - to manage the persistent kernel design. — Robert Crovella, Oct 27 '17 at 13:20

score 1 · Accepted Answer · answered Oct 26 '17 at 07:58

1

You'll need to batch the requests up into a larger block and invoke a kernel on many particles. You can likely use the third dimension of the kernel to iterate over them. One way to do this is to accumulate incoming particles while the kernel is running. If you do not get enough particles to justify the kernel launch, process them on the CPU.

If the particles are being produced on the GPU, you have the option to launch a kernel from a kernel with newer versions of CUDA, but you still need a pretty large block to make that win.

If these are coming from the CPU and then going back to the CPU, I'd be surprised if you can make it pay off at all unless the number of matrices is pretty large. (Comparing to well optimized SIMD CPU code.)

answered Oct 26 '17 at 07:58

Zalman Stern

3,161
12
18

@ZalmanStein I hadn't considered SIMD as an alternative. I certainly wouldn't like to start learning about SSE since it seems very complicated to me, but I might be able to achieve what I want with a higher level library such as Intel MKL do you think? As a rule of thumb, in which use cases should I consider using CPU SIMD instead of GPU acceleration? – Mr Squid Oct 27 '17 at 03:17
The main issue is the cost of the dataflow. There is a fixed overhead in transferring a small chunk of data to/from the GPU and it will take a lot of computation to make it worthwhile to do so. Check out Halide, halide-lang.org. It may allow writing the code in a way that allows targeting CPU and GPU efficiently. (Depends on a bit on specifics of course. As a disclaimer, I work on Halide.) – Zalman Stern Oct 27 '17 at 04:04

Streaming data to CUDA kernels without multiple launches

1 Answers1