GLSL vertex shader performance with early return and branching

Question

I have a vertex shader as such

void main (){

    vec4 wPos = modelMatrix * vec4( position , 1. );

    vWorldPosition = wPos.xyz;

    float mask = step(
        0.,
        dot(
            cameraDir, 
            normalize(normalMatrix * aNormal)
        )
    );

    gl_PointSize = mask * uPointSize;

    gl_Position = projectionMatrix * viewMatrix * wPos;

}

I'm not entirely sure how to test the performance of the shader, and exclude other factors like overdraw. I imagine a point of size 1, arranged in a grid in screen space without any overlap would work?

Otherwise i'm curious about these tweaks:

(removes step, removes a multiplication, introduces if else)

void main (){

    if(dot(
         cameraDir, 
         normalize(normalMatrix * aNormal) //remove step
    ) < 0.) {
        gl_Position = vec4(0.,.0,-2.,.1); 
        gl_PointSize = 0.;
    } else {

        gl_PointSize = uPointSize; //remove a multiplication

        vec4 wPos = modelMatrix * vec4( position , 1. );

        vWorldPosition = wPos.xyz;
        gl_Position = projectionMatrix * viewMatrix * wPos;
    }

}

vs something like this:

void main (){

    if(dot(
         cameraDir, 
         normalize(normalMatrix * aNormal) 
    ) < 0.) {
        gl_Position = vec4(0.,.0,-2.,.1); 
        return;
    }

    gl_PointSize = uPointSize; 

    vec4 wPos = modelMatrix * vec4( position , 1. );

    vWorldPosition = wPos.xyz;

    gl_Position = projectionMatrix * viewMatrix * wPos;

}

Will these shaders behave differently and why/how?

I'm interested if there is a something to quantify the difference in performance.

Is there some value, like number of MADs or something else that the different code would obviously yield?
Would different generation GPUs treat these differences differently?
If the step version is guaranteed to be fastest, is there a known list of patterns of how branching can be avoided, and which operations to prefer? (Like using floor instead of step could also be possible?):

.

float condition = clamp(floor(myDot + 1.),0.,1.); //is it slower?

There's the obvious difference that your revised shaders don't set the outputs (gl_pointSize/gl_Position) on all paths, where the original does, so any vertex that hits those cases may produce garbage. — Chris Dodd, May 03 '18 at 01:39
Ahh i didn't realize `gl_PointSize` matters too. I think i've seen it behave differently on different browsers, some times being 0, meaning the spec does not define it? — pailhead, May 03 '18 at 03:51
I will edit that one to make more sense, leave the return one as it's own sample? — pailhead, May 03 '18 at 03:51
The second example now does set both outputs, but has less computation in one branch. — pailhead, May 03 '18 at 03:53
second example still seems to hit the return path in the first `if` without writing `gl_PointSize` or `gl_Position`, so is still illegal. — solidpixel, May 04 '18 at 09:34

score 1 · Answer 1 · answered May 03 '18 at 01:40

1

Conditional branches are expensive on GPUs -- generally significantly more expensive than multiplies, so your revised shaders are probably slower.

answered May 03 '18 at 01:40

Chris Dodd

119,907
13
134
226

Is it possible to elaborate more on what exactly, is happening? Do different GPUs handle this case depending on the generation or something else? Is there a number of instructions that could be derived from this (x number of MADs, or something else)? – pailhead May 03 '18 at 03:54
I have amended the question. – pailhead May 03 '18 at 04:03
`float result = foo * mask.x + bar * mask.y;` vs `if(mask.x > 0.5) { result = foo; } else { result = bar; }` – pailhead May 03 '18 at 04:05

gman · Accepted Answer · 2018-05-03T07:31:17.403

1

There are just way too many variables so the answer is "it depends". Some GPU can handle branches. Some can't and the code is expanded by the compiler so that there are no branches, just math that is multiplied by 0 and other math that is not. Then there's things like tiling GPUs that attempt to aggressively avoid overdraw. I'm sure there are other factors.

Theoretically you can run a million or a few million iterations of your shader and time it with

gl.readPixels(one pixel);
const start = performance.now();
...draw a bunch..
gl.readPixels(one pixel);
const end = performance.now();
const elapsedTime = end - start;

gl.readPixels is a synchronous operation so it's stalls the GPU pipeline. The elapsedTime itself is not the actual time since it includes starting up the GPU and stopping it among other things it but it seems like you could compare the elapsedTime from one shader with another to see which is faster.

In other words if elapsedTime is 10 seconds it does not mean your shader took ten seconds. It means it took 10 seconds to start the gpu, run your shader, and stop the GPU. How many of those seconds are start, how many are stop and how many are your shader isn't available. But, if elaspedTime for one shader is 10 seconds and 11 for another than it's probably safe to say one shader is faster than the other. Note you probably want to make your test long enough that you get seconds of difference and not microseconds of difference. You'd also need to test on multiple GPUs to see if the speed differences always hold true.

Note that calling return in the vertex shader does not prevent the vertex from being generated. In fact what gl_Position is in that case is undefined.

edited May 03 '18 at 07:31

answered May 03 '18 at 07:23

gman

100,619
31
269
393

relevant link: https://gamedev.stackexchange.com/questions/158173/does-cause-branching-in-glsl – gman May 03 '18 at 15:54
That is a relevant link. I'm getting the feeling that i might be missing out basic understanding of how compilation works. In addition to missing out on how GPUs work. Is this a fair statement, would these inhibit the understanding of this? `"generally more expensive"` <- is still a black box. But, consuming all the info, one step at a time :) – pailhead May 03 '18 at 17:59
With webgl, one could mesure this, the same way webgl stats measure which extensions are available across devices/browsers? Would it be impossible to compile a database of a lot of examples, and come up with some ranges? FOO behaves like this BAR like that? – pailhead May 03 '18 at 18:02
`...draw a bunch` should draw lots of vertices? Comparing 3 matrix multiplications with 2 gives very very similar results at 2.000.000 iterations? It's a 1080, would this mean that the draw call overhead is eating most of the 14 seconds it takes to run it? Would it be better to instance a very heavy mesh many times? – pailhead May 03 '18 at 18:34
another relevant link: https://stackoverflow.com/questions/16156130/why-is-my-program-so-slow – gman May 04 '18 at 00:17
The point in second link is mainly to outline the “it depends” aspect? That is, given the unpredictability, benchmark? – pailhead May 04 '18 at 08:58

GLSL vertex shader performance with early return and branching

2 Answers2

Linked