Does any graphics API allow efficient per-primitive branching?

Question

When writing fragment shaders in OpenGL, one can branch either on compile-time constants, on uniform variables or on varying variables.

How performant that branching is depends on the hardware and driver implementation, but generally branching on a compile time constant is usually free and branching on a uniform is faster than on a varying.

In the case of a varying, the rasterizer still has to interpolate the variable for each fragment and the branch has to be decided on each family execution, even if the value of the varying is the same for each fragment in the current primitive.

What I wonder is whether any graphics api or extension allows some fragment shader branching that is executed only once per rasterized primitive (or in the case of tiled rendering once per primitive per bin)?

Branching on varyings is (in my experience) equally fast to uniform branching when all threads in the same warp follow the same code path. The performance massively drops when different threads in the same warp take different branches since due to the SIMD architecture the different branches are executed one after the other. So what you ask for should be happening in any graphics API on any reasonably new hardware. — BDL, Dec 04 '21 at 11:57
@BDL: Note that whether all fragments in a wavefront come from the same primitive is something that diverges based on hardware. Some implementations do this, and others don't. It can be faster, particularly for very small polygons, to run multiple primitives in the same wavefront. — Nicol Bolas, Dec 04 '21 at 14:23

score 4 · Answer 1 · answered Dec 04 '21 at 15:13

Dynamic branching is only expensive when it causes divergence of instances executing at the same time. The cost of interpolating a "varying" is trivial.

Furthermore, different GPUs handle primitive rasterization differently. Some GPUs ensure that wavefronts for fragment shaders only contain instances that are executing on the same primitive. On these GPUs, branching based on values that that don't change per-primitive will be fast.

However, other GPUs will pack instances from different primitives into the same wavefronts. On these GPUs, divergence will happen if the value is different for different primitives. How much divergence? It rather depends on how often you get multiple instances in a primitive. If many of your primitives are small in rasterized space, then you'll get a lot more divergence than if you have a lot of large primitives.

GPUs that pack instances from different primitives into a wavefront are trying to maximize how much their cores get utilized. It's a tradeoff: you're minimizing the overall number of wavefronts you have to execute, but a particular cause of divergence (data that is constant within a primitive but not between them) will be penalized.

In any case, try to avoid divergence when you can. But if your algorithm requires it... then your algorithm requires it, and the performance you get is the performance you get. The best you can do is let the GPU know that the "varying" will be constant per-primitive by using flat interpolation.

Does any graphics API allow efficient per-primitive branching?

1 Answers1