"Bytecode" instead of hardcoded shader performance

Question

I'm making a graphics program that generates models. When the user performs some actions, the behavior of the shader needs to change. These actions don't only affect numeric constants, nor input data, they affect the number, order and type of a series of computation steps.

To solve this problem two solutions came to my mind:

Generate the shader code at run-time and compile it then. This is very CPU dependent, since the compilation can take some time, but it is very GPU friendly.
Use some kind of bytecode that the same shader interprets at run-time. This removes the need of compiling the shader again, but now the GPU needs to take care of much bookkeeping.

I developed prototypes for both approaches and the results are more extreme than I expected.

The compilation times depends much on the rest of the shader (I guess that there is a lot of function in-lining), I think I could refactor the shader to do less work per thread and improve the compilation time. But, I don't know now if this will be enough, and I don't like very much the idea of run-time recompilation (very platform dependent, more difficult to debug, more complex).

On the other hand, the bytecode approach runs (without taking the compilation time of the first approach in account) 25 times slower.

I knew that the bytecode approach was going to be slower, but I didn't expect this, particularly after optimizing it.

The interpreter works by reading bytecode from an uniform buffer object. This is a simplification of it, I placed a "..." where the useful (non bookkeeping) code goes, that part is the same as the other approach (obviously, that is not inside a loop with a big if/else to select the proper instruction):

layout (std140, binding=7) uniform shader_data{
    uvec4 code[256];
};

float interpreter(vec3 init){
        float d[4];
        vec3 positions[3];
        int dDepth=0;
        positions[0]=init;
        for (int i=0; i<code[128].x; i+=3){
            const uint instruction=code[i].x;
            const uint ldi=code[i].y;
            const uint sti=code[i].z;
            if (instruction==MIX){
                ...
            }else{
                if (instruction<=BOX){
                    if (instruction<=TRANSLATION){
                        if(instruction==PARA){
                            ...
                        }else{//TRANSLATION;
                            ...
                        }
                    }else{
                        if (instruction==EZROT){
                            ...
                        }else{//BOX
                            ...
                        }
                    }
                }else{
                    if (instruction<=ELLI){
                        if (instruction==CYL){
                           ...
                        }else{//ELLI
                           ...
                        }
                    }else{
                        if (instruction==REPETITION){
                           ...
                        }else{//MIRRORING
                           ...
                        }
                    }
                }
            }
        }
        return d[0];
    }

My question is: do you know why is so much slower (because I don't see so much bookkeeping in the interpreter)? Can you guess what are the main performance problems of this interpreter?

"what are the main performance problems of this interpreter?" [Branches](https://stackoverflow.com/questions/17223640/is-branch-divergence-really-so-bad). — genpfault, May 23 '17 at 20:26
I know branches incurs a performance penalty, but in this case the branching is completely uniform (non-divergent): all threads will take the same branch. Do you still think that branches are the main problem? I tried using a switch/case and another if/else structure, but this is the the version with more performance. I even put the MIX instruction first because MIX instructions represents almost the 50% of a normal bytecode. — dv1729, May 23 '17 at 20:31
Reading comprehension fail on my part, guess I need some afternoon coffee :) Good call on the uniform branches, other than that I'd guess sheer shader size and/or the shader compiler being unable to optimize as well as it can on a "standard" shader :/ — genpfault, May 23 '17 at 20:41

Nicol Bolas · Accepted Answer · 2017-05-23T21:10:43.453

GPUs don't like conditional branching at the best of times. Byte code interpretation is thus one of the worst things you could possibly do on a GPU.

Granted, the principle problem of branching is not so bad in your case, because your "byte code" is all in uniform memory. Even so, it's going to run excessively slow due to all of the branches.

It would be much better to have a better handle on the possibilities of your shader at a high level, then use a very small number of branches to decide what your entire shader's behavior will be. These wouldn't be at the level of byte code. They'd be more like "compute positions with matrix skinning" or "compute lighting with this BRDF" or "use a shadow map".

This is the so-called "ubershader" approach: one shader, with a number of large and distinct codepaths that are determined by a few uniform settings.

If you can't do that, then there's really not much you can do outside of recompiling when needed. And that's going to hurt on the CPU; you cannot expect to use the shader the frame you start recompiling it (or for several frames thereafter, in all likelihood). SPIR-V shaders might help recompilation performance, but probably not that much.

although having a small delay (~100ms) is not that bad since it is not a game

I say measure the time it takes to do shader compilation. If it's less than 100ms (or whatever you consider to be sufficiently interactive), go with it.

However, be advised that many OpenGL implementations recompile shaders on a separate thread. So by the time glLinkProgram has finished, the shader may not be done. To accurately profile this process, you need to force the recompilation to have happened. Getting the GL_LINK_STATUS should do the trick.

One more performance trick: do not use glCompileShader and glLinkProgram. Instead, use glCreateShaderProgramv instead. It makes a separable program (containing just one shader stage), but that process will likely be faster than having to compile and link as separate actions.

"If you can't do that" I can't. And I cannot compile it in the background since the user provokes the recompilation and he will expect/need the result at that moment (although having a small delay (~100ms) is not that bad since it is not a game). "SPIR-V shaders might help" I've been trying to use them recently, but I'm not sure about how they might help in this, why do you think it might help? Are you thinking on using the SPIR-V compiler at run-time or using the SPIR-V compiler to compile the interpreter? — dv1729, May 23 '17 at 20:59
I mean that SPIR-V shaders *theoretically* compile faster than GLSL directly. So it might make the shader recompilation option less slow. Also, see the note I added to the answer. — Nicol Bolas, May 23 '17 at 21:07
You mean... by generating the SPIR-V by myself instead of generating GLSL? Do you think that is reasonably achievable? — dv1729, May 23 '17 at 21:10
@dv1729: You generated byte-code just fine, didn't you? Just use it to create SPIR-V. SPIR-V can be a little odd, particularly the static-single-assignment aspect of it, but it's intended to be the target of a compilation process. And your generating of this shader is a compilation process. It shouldn't be too hard. But before going through that effort, profile to make sure that 1) GLSL compiles too slowly for you, and 2) SPIR-V will compile fast enough for you. — Nicol Bolas, May 23 '17 at 21:12
Thanks! I will look more into the compilation approach (trying to reduce compilation times, profiling, SPIR-V...) — dv1729, May 23 '17 at 21:29

"Bytecode" instead of hardcoded shader performance

1 Answers1