Is multiple calls to glDrawElements more efficient than doing the same calculations per-fragment in GLSL?

Question

I'm experimenting with GLSL (in iOS) and I wrote a simple shader that takes a colour value and parameters for two circles (center, radius, and edgeSmoothing). It is drawn using a single quad over the entire screen, the shader uses gl_FragCoord and determines if each point is inside or outside the circles - it calculates an alpha of 1.0 inside the circles, smoothly shading down to 0.0 outside radius + edgeSmoothing, then it applies a mirror-style clamp to alpha (triangle wave to get an even-odd fill-rule effect) and sets gl_FragColor = mix(vec4(0.0), color, alpha);.

This works fine but I want 10 circles in 5 different colours, so I call glUniform for all the shader uniforms and glDrawElements to draw the quad five separate times (with the different colours and circle parameters), and my blend mode is additive so the different colours add up nicely to give the patterns I want, perfect!

Remember, this is an experiment, so I'm trying to learn about GL and GLSL more than draw the circles.

Now I think it will be much more efficient to draw the quad just once and pass in the parameters for all 10 circles into uniform arrays (centers[10], radii[10], etc.), looping through them in the GLSL and adding up the colours they produce in the shader. So I write this shader and refactor my code to pass in all the circle parameters at once. I get the correct result (the output looks exactly the same) but my frame-rate drops from 15fps to about 3fps - it's five times slower!!

The shader code now has loops, but uses the same maths to calculate the alpha value for each pair of circles. Why is this so much slower? Surely I'm doing less work than filling the whole screen five times and GL doing the additive blending five times (i.e. reading pixel values, blending, and writing back)? Now I'm just calculating the accumulated colour and filling the whole screen just once?

Can anyone explain why what I thought would be an optimisation had the opposite effect?

Update: Paste this code into ShaderToy to see what I'm talking about.

#ifdef GL_ES
precision highp float;
#endif

uniform float time;

void main(void)
{
    float r, d2, a0, a1, a2;
    vec2 pos, mid, offset;
    vec4 bg, fg;

    bg = vec4(.20, .20, .40, 1.0);
    fg = vec4(.90, .50, .10, 1.0);
    mid = vec2(256.0, 192.0);

    // Circle 0
    pos = gl_FragCoord.xy - mid;
    d2 = dot(pos, pos);
    r = 160.0;
    a0 = smoothstep(r * r, (r + 1.0) * (r + 1.0), d2);

    // Circle 1
    offset = vec2(110.0 * sin(iGlobalTime*0.8), 110.0 * cos(iGlobalTime));
    pos = gl_FragCoord.xy - mid + offset;
    d2 = dot(pos, pos);
    r = 80.0;
    a1 = smoothstep(r * r, (r + 1.0) * (r + 1.0), d2);

    // Circle 2
    offset = vec2(100.0 * sin(iGlobalTime*1.1), -100.0 * cos(iGlobalTime*0.7));
    pos = gl_FragCoord.xy - mid + offset;
    d2 = dot(pos, pos);
    r = 80.0;
    a2 = smoothstep(r * r, (r + 1.0) * (r + 1.0), d2);

    // Calculate the final alpha
    float a = a0 + a1 + a2;
    a = abs(mod(a, 2.0) - 1.0);

    gl_FragColor = mix(bg, fg, a);
}

Without seeing your shader I would say that you have dynamic branching issues here and thus can't leverage the complete GPU pipeline. Try unrolling the loop by hand and see if it fixes the issue. — JustSid, Oct 30 '12 at 00:40

score 3 · Accepted Answer · edited May 23 '17 at 12:33

Increasing the complexity of operations in a fragment shader can have a nonlinear effect on rendering time. Even the addition of one simple-looking branching operation can make a shader 10X slower in some cases.

Loops in particular are horrible within fragment shaders on the iOS devices, so I'd avoid them at all costs. I bet if you unrolled that loop into a series of checks against your uniform values, it would perform better.

However, running 10 checks against your uniforms, which sounds like it involves steps or smoothsteps, is going to be very expensive when applied to every pixel in your framebuffer. It's also fairly wasteful, as a huge portion of your screen isn't going to be covered by any particular circle.

There's no need to draw the individual circles using separate glDrawElements() calls, or do so by drawing screen-sized quads. I describe a process I use to draw sphere impostors in my open source application within this answer where I can draw thousands of circles (spheres) onscreen at 60 FPS on the latest iOS devices. For that, I pass in a quads for each circle that's just large enough contain that circle and no larger. These quads are all bunched in an array and drawn at once. Additional parameters for each circle are passed in as attributes alongside the vertex data. For example, I don't need to specify a radius because I use impostor space coordinates from (-1, -1) to (1, 1) alongside the vertices and do simple calculations to determine if a point is within the circle.

If you draw only the fragments required for each circle, and no more, you'll take a lot of the load off of the fragment processing part of the pipeline. You'll still need to enable a blending mode, but the reduction in quad size, combined with the simplification of operations performed in your fragment shader, will lead to much better performance overall.

Thanks for this answer - this and further reading has really helped me understand GLSL a bit more and how branching and looping can kill the parallelisation. I'd love to do it this way, but I don't think GL blending can do the even-odd effect I'm going for (see the shader code in my update) - because everything gets clamped too soon. Is there a way to allow additive effects to go beyond 1.0 and then use my own clamping rule (`a = abs(mod(a, 2.0) - 1.0)`)? — jhabbott, Nov 01 '12 at 01:15
@jhabbott - You could still do an additive blend, with the output value of each single pass of your circles scaled by the inverse of the number of circles you're using (1/10th in this case). You could then apply your modulus as a second pass, taking in the input from the first pass and scaling it back up by the number of circles. You'd lose a little dynamic range (based on the number of circles involved), but it would still be fast. — Brad Larson, Nov 01 '12 at 01:50
I thought of that too, but you don't get the destination buffer data in the fragment shader to read back so how could I access the previously written data? — jhabbott, Nov 01 '12 at 02:14
@jhabbott - You would do the additive blend first for all of your circles, and have the FBO that you're rendering that into be backed by a texture. For the second pass, you'd load from that texture and do your modulus operation on the colors within it. Alternatively, you could use the new iOS 6.0 framebuffer load operations for an even faster read of the previously output color. — Brad Larson, Nov 01 '12 at 02:31
Ahh, I didn't think of using a separate texture as temporary storage, thanks :) — jhabbott, Nov 04 '12 at 11:23

Is multiple calls to glDrawElements more efficient than doing the same calculations per-fragment in GLSL?

1 Answers1