I am trying to GPU accelerate an algorithm where I receive an asynchronous stream of particles in 3D space $p=[x,y,t]$. Each vector $p_n$ needs to be multiplied by a bunch of transformation matrices. Since these transformations are independent of each other they can happen in parallel, so I have written a CUDA kernel to do that. It works well, but of course for each incoming $p_n$ I end up launching the CUDA kernel anew. Launching a CUDA kernels carries a major time penalty, and thus I lose the advantage of GPU acceleration. So my question is, can I keep the kernel open and stream the particles to it somehow?
In case it's any help here is my current kernel:
__global__
void project(float *projection_matrix, float *vector, float *output_matrix) {
int col_index = blockIdx.x * blockDim.x + threadIdx.x;
int row_index = blockIdx.y * blockDim.x + threadIdx.y;
int output_index = (col_index*3 + threadIdx.y);
int transform_first_element = col_index * 9 + threadIdx.y * 3;
int stride = blockDim.x*blockDim.y*gridDim.x;
while (output_index < (NUMBER_OF_TRANSFORMS * 3)) {
output_matrix[output_index] = projection_matrix[transform_first_element]*vector[0]+ projection_matrix[(transform_first_element+1)]*vector[1] + projection_matrix[(transform_first_element+2)]*vector[2];
output_index += stride;
}
}
and this is where I call it:
...
project <<<num_blocks_dim, block_dim >>> (transformationList, inputVector, outputMatrix);
cudaDeviceSynchronize();
...