I have a grid of rectangles. Each of these rectangles consists of a rectangular grid of points. All points inside the rectangle can be treated by exactly the same instruction sequence in a kernel. I will be able to launch a kernel with 10000s of points to handle, where each thread would handle about 10-50 points. The points on the edges and on the corners of the rectangles however will lead to a large set of different instruction sequences.
From a design point of view, it would be easier to launch a kernel for each set of points with the same instruction sequence. This would mean that some kernel launches would only treat very few points, probably less than 10.
So I would have maybe 4 kernel launches with 10000s of points to handle (10-50 points for each thread), and maybe 30-100 kernel launches with only a few points each (1 point per thread normally).
I have absolutely no idea whether this is acceptable or whether it will completely destroy my performance. I would be glad if you could give me a rough estimate or at least some hints, what to consider to get an estimate.