I'm making a graphics program that generates models. When the user performs some actions, the behavior of the shader needs to change. These actions don't only affect numeric constants, nor input data, they affect the number, order and type of a series of computation steps.
To solve this problem two solutions came to my mind:
- Generate the shader code at run-time and compile it then. This is very CPU dependent, since the compilation can take some time, but it is very GPU friendly.
- Use some kind of bytecode that the same shader interprets at run-time. This removes the need of compiling the shader again, but now the GPU needs to take care of much bookkeeping.
I developed prototypes for both approaches and the results are more extreme than I expected.
The compilation times depends much on the rest of the shader (I guess that there is a lot of function in-lining), I think I could refactor the shader to do less work per thread and improve the compilation time. But, I don't know now if this will be enough, and I don't like very much the idea of run-time recompilation (very platform dependent, more difficult to debug, more complex).
On the other hand, the bytecode approach runs (without taking the compilation time of the first approach in account) 25 times slower.
I knew that the bytecode approach was going to be slower, but I didn't expect this, particularly after optimizing it.
The interpreter works by reading bytecode from an uniform buffer object. This is a simplification of it, I placed a "..." where the useful (non bookkeeping) code goes, that part is the same as the other approach (obviously, that is not inside a loop with a big if/else to select the proper instruction):
layout (std140, binding=7) uniform shader_data{
uvec4 code[256];
};
float interpreter(vec3 init){
float d[4];
vec3 positions[3];
int dDepth=0;
positions[0]=init;
for (int i=0; i<code[128].x; i+=3){
const uint instruction=code[i].x;
const uint ldi=code[i].y;
const uint sti=code[i].z;
if (instruction==MIX){
...
}else{
if (instruction<=BOX){
if (instruction<=TRANSLATION){
if(instruction==PARA){
...
}else{//TRANSLATION;
...
}
}else{
if (instruction==EZROT){
...
}else{//BOX
...
}
}
}else{
if (instruction<=ELLI){
if (instruction==CYL){
...
}else{//ELLI
...
}
}else{
if (instruction==REPETITION){
...
}else{//MIRRORING
...
}
}
}
}
}
return d[0];
}
My question is: do you know why is so much slower (because I don't see so much bookkeeping in the interpreter)? Can you guess what are the main performance problems of this interpreter?