OpenGL state redundancy elimination Tree, render state priorities

Question

I am working on a Automatic OpenGL batching method in my Game Engine, to reduce draw calls and redundant calls.

My batch tree design begins with the most expensive states and adds leafs down for each less expensive state.

Example: Tree Root: Shaders / Programs Siblings: Blend states ... a.s.o.

So my question is what are most likely the most expensive calls, in this list:

binding program
binding textures
binding buffers
buffering texture, vertex data
binding render targets
glEnable / glDisable
blend state equation, color, functions, colorWriteMask
depth stencil state depthFunc, stencilOperations, stencilFunction, writeMasks

Also wondering which method will be faster:
- Collect all batchable draw commands to single vertex buffer and call only 1 draw call (this method would also force to update matrix transforms per vertex on cpu side)
- Do not batch at all and render many small draw calls, only batch particle system ...

PS: Render Targets will always Pre or Post changed, depending on usage.

Progress so far:

Andon M. Coleman: Cheapest Uniform & Vertex Array Binding, Expensive FBO, Texture Bindings
datenwolf: Programs invalidate State Cache

1: Framebuffer states
2: Program
3: Texture Binding
...
N: Vertex Array binding, Uniform binding

Current execution Tree in WebGL:

Program
Attribute Pointers
Texture
Blend State
Depth State
Stencil Front / Back State
Rasterizer State
Sampler State
Bind Buffer
Draw Arrays

Each step is a sibling hash tree, to avoid checking agains state cache inside of main render queue

Loading Textures / Programs / Shaders / Buffers happens before rendering in an extra queue, for future multi threading and also to be sure that the context is initialized before doing anything with it.

The biggest problem of self rendering objects is that you cannot control when something happens, for example if a developer calls these methods before gl is initialized, he wouldn't know why but he would have some bugs or problems...

What is your target GL version? Your list is missing render target (FBO) state changes, which are *very* expensive (but not necessarily an issue depending on version). — Andon M. Coleman, Aug 26 '14 at 12:49
Thanks for the tip, i wanted to introduce FBO control soon after getting the base architecture work. — Zeto, Aug 26 '14 at 13:33
I'm interested in any version, but most of OpenGLES 2, 3 btw. just include the version where you know something about :) — Zeto, Aug 26 '14 at 13:35
I can tell you that vertex array and uniform states are the cheapest states you can change on all GL implementations and FBO states and texture binding tend to be the most expensive. I could not really give you a bullet list like in your question though, just the extreme ends of this list. I would also point out that a lot of the expense is not the initial call to `glBindXXX (...)` for instance, but what happens when you make a draw call after 20 odd states are changed and they all have to be validated at once. — Andon M. Coleman, Aug 26 '14 at 14:05
What do you thing is more expensive (texture vs program) and (texture vs blend/depth/stencil/rasterizer/sampler states) — Zeto, Aug 26 '14 at 14:45
Definitely programs, because a program change always comes with code cache invalidation. So after changing the program the GPU has to start with a cold execution cache. Sampler states behave much more like uniforms, but are not quite as cheap. It should also be pointed out, that the ordering of the expenses depends on the driver and the GPU being used, but also the program the GPU currently runs. As a rule of thumb, anything that makes the cache cold is a major performance killer. It's hard to overemphase, how important cache coherence and access patterns are for GPU performance. — datenwolf, Aug 26 '14 at 16:19
Just to give an example: When we implemented the system described in the paper http://dx.doi.org/10.1364/BOE.5.002963 at one point I could achieve a 100× increase in throughput by a unsuspecting reordering of data accesses and another 10× increase by improving the data alignment. While this was all CUDA work and not OpenGL work (which I had those performance boosts with) it shines a very strong light on how delicate GPUs are when it comes to their data access patterns. — datenwolf, Aug 26 '14 at 16:22
Thanks for the paper, well my main Aim is to create a nice High Level Batching method so CUDA or OpenGL it doesn't make any change for me ^^ but sure i know that the implementations can vary ^^ — Zeto, Aug 26 '14 at 19:28

score 7 · Accepted Answer · edited Apr 22 '15 at 15:00

7

The relative costs of such operations will of course depend on the usage pattern and your general scenario. But you might find Nvidia's "Beoynd Porting" presentation slides as a useful guide. Let me reproduce especially slide 48 here:

Relative Cost of state changes

In decreasing cost...

Render Target ~60K/s

Program ~300K/s

ROP

Texture Bindings ~1.5M/s

Vertex Format

UBO Bindings

Uniform Updates ~10M/s

This does not directly match all of the bullet points of your list. E.g. glEnable/glDisable might affect anything. Also GL's buffer bindings are nothing the GPU directly sees. Buffer bindings are mainly a client side state, depending on the target, of course. Change of blending state would be a ROP state change, and so on.

edited Apr 22 '15 at 15:00

genpfault

51,148
11
85
139

answered Aug 26 '14 at 16:36

derhass

43,833
2
57
78

Thanks for the slide i will check it out soon, there is also another thing that I'm wondering about, if the cpu & gpu share the same memory is the memory being copied over to a new address or do they just share pointers to it? An example scenario would be the iOS Device Architecture – Zeto Aug 26 '14 at 19:24
If i measure execution time, will this only be on cpu side or would it also count the gpu time? Maybe using [glFinish](https://www.khronos.org/opengles/sdk/docs/man/xhtml/glFinish.xml) ? – Zeto Aug 26 '14 at 19:42
@Zeto: That memory stuff is all completely implementation-dependent. I don't know anything specific about iOS architecture. For the timing: you only measure the CPU time. On desktop GL, there is the [GL_ARB_timer_query extension](https://www.opengl.org/registry/specs/ARB/timer_query.txt) which allows to measure the GPU timing. Dunno if there is something similiar for GLES. There might also be external GL(ES) debugging and profiling tools and platform-specific APIs for performance counters. – derhass Aug 26 '14 at 20:03
thanks for the tip, yes i do have "Instruments" from apple which also hints me for redundancy i have to check if i can measure single command execution times, may i can create the list above in this way... – Zeto Aug 26 '14 at 20:56

score 0 · Answer 2 · answered Aug 27 '14 at 06:38

This tends to be highly platform/vendor dependent. Any numbers you may find apply to a specific GPU, platform and driver version. And there are a lot of myths floating around on the internet about this topic. If you really want to know, you need to write some benchmarks, and run them across a range of platforms.

With all these caveats:

Render target (FBO) switching tends to be quite expensive. Highly platform and architecture dependent, though. For example if you have some form of tile based architecture, pending rendering that would ideally be deferred until the end of the frame may have to be completed and flushed out. Or on more "classic" architectures, there might be compressed color buffers or buffers used for early depth testing that need consideration when render targets are switched.
Updating texture or buffer data is impossible to evaluate in general terms. It obviously depends heavily on how much data is being updated. Contrary to some claims on the internet, calls like glBufferSubData() and glTexSubImage2D() do not typically cause a synchronization. But they involve data copies.
Binding programs should not be terribly expensive, but is typically still more heavyweight than the state changes below.
Texture binding is mostly relatively cheap. But it really depends on the circumstances. For example, if you use a GPU that has VRAM, and the texture is not in VRAM at the moment, it might trigger a copy of the texture data from system memory to VRAM.
Uniform updates. This is supposedly very fast on some platforms. But it's actually moderately expensive on others. So there's a lot of variability here.
Vertex state setup (including VBO and VAO binding) is typically fast. It has to be, because it's done so frequently by most apps that it can very quickly become a bottleneck. But there are similar consideration as for textures, where buffer memory may have to be copied/mapped if it was not used recently.
General state updates, like blend states, stencil state, or write masks, are generally very fast. But there can be very substantial exceptions.

Just a typical example of why characteristics can be so different between architectures: If you change blend state, that might be sending a couple of command words on one architecture, with minimal overhead. On other architectures, blending is done as part of the fragment shader. So if you change blend state, the shader program has to be modified to patch in the code for the new blending calculation.

Thanks, well i do always use own fragment shaders & vertex shaders. But still i offer a external Effect class which can be used by a developer to change the fragment shaders but i do inject my engine code into the source file to guarantee functionality — Zeto, Aug 27 '14 at 15:20
I have now a WebGL version running, done in js because the development is pretty fast... I'll start stress test this afternoon & change the batch tree priorities, may i will get some useful data in this way which seams to be more slowly — Zeto, Aug 27 '14 at 15:22
What would you prefer for a benchmark? Btw. all say that i can only measure the cpu timing but not the gpu one, because of the Client <> Server Architecture on some devices — Zeto, Aug 27 '14 at 15:24

OpenGL state redundancy elimination Tree, render state priorities

2 Answers2

Linked