OpenGL Low-Level Performance Questions

Question

This subject, as with any optimisation problem, gets hit on a lot, but I just couldn't find what I (think) I want.

A lot of tutorials, and even SO questions have similar tips; generally covering:

Use GL face culling (the OpenGL function, not the scene logic)
Only send 1 matrix to the GPU (projectionModelView combination), therefore decreasing the MVP calculations from per vertex to once per model (as it should be).
Use interleaved Vertices
Minimize as many GL calls as possible, batch where appropriate

And possibly a few/many others. I am (for curiosity reasons) rendering 28 million triangles in my application using several vertex buffers. I have tried all the above techniques (to the best of my knowledge), and received almost no performance change.

Whilst I am receiving around 40FPS in my implementation, which is by no means problematic, I am still curious as to where these optimisation 'tips' actually come into use?

My CPU is idling around 20-50% during rendering, therefore I assume I am GPU bound for increasing performance.

Note: I am looking into gDEBugger at the moment

Cross posted at Game Development

score 26 · Accepted Answer · answered Mar 09 '11 at 10:31

Point 1 is obvious, as is saves fill rate. In case the primitives of an objects backside get processed first this will omit those faces. However modern GPUs tolerate overdraw quite well. I once (GeForce8800 GTX) measured up to 20% overdraw before significant performance hit. But it's better to save this reserve for things like occlusion culling, rendering of blended geometry and the like.

Point 2 is, well pointless. The matrices never have been calculated on the GPU – well, if you don't count SGI Onyx. Matrices always were just some kind of rendering global parameter calculated on the CPU, then pushed into global registers on the GPU, now called a uniform, so joining them has only very little benefit. In the shader that saves only one additional vector matrix multiplication (boils down to 4 MAD instructions), at the expense of less algorithmic flexibility.

Point 3 is all about cache efficiency. Data belonging together should fit into a cache line.

Point 4 is about preventing state changes trashing the caches. But it strongly depends which GL calls they mean. Changing uniforms is cheap. Switching a texture is expensive. The reason is, that a uniform sits in a register, not some piece of memory that's cached. Switching a shader is expensive, because different shaders exhibit different runtime behaviour, thus trashing the pipeline execution predition, altering memory (and thus) cache access patterns and so on.

But those are all micro optimizations (some of them with huge impact). However I recommend looking in large impact optimizations, like implementing an early Z pass; using occlusion query in th early Z for quick discrimination of whole geometry batches. One large impact optimization, that essentially consists of summing up a lot of Point-4 like micro optimizations is to sort render batches by expensive GL states. So group everything with common shaders, within those groups sort by texture and so on. This state grouping will only affect the visible render passes. In early Z you're only testing outcomes on the Z buffer so there's only geometry transformation and the fragment shaders will just pass the Z value.

Very nice answer. A question though, in your response to point 2, I'm a little confused. I was comparing the difference between having the "model * projection * view" inside the shader (as uniform variables, modelview sent each time a model changes); versus a single uniform matrix variable (modelviewprojection) updated per model, which is calculated (once) by the CPU instead of per vertex. Surely that would save many calculations? — hiddensunset4, Mar 09 '11 at 11:25
@Daniel: You normally don't compute the MVP matrix in the shader. What you do is first performing the calculation modelview_position = MV * vertex_position, and then clip_position = P * modelview_position. The reasoning behind this is, that for some algorithms you need the modelview transformed vertex position inbetween, not just the final result of the whole projection process. Also vertex normals are only transformed by the inverse transpose of MV, not the full MVP^T^-1, so that's another reason: If you want to implement nice lighting you need those transformed normals. — datenwolf, Mar 09 '11 at 11:33
@Daniel: Yes, you normally supply MV^T^-1 in a separate uniform, but sometimes you just need that unprojected MV, too. And Since you don't have to carry out the full matrix multiplication, (16 MADs), but only vector * matrix multiplications (twice => 8 MADs) it's not that bad. — datenwolf, Mar 09 '11 at 11:37

score 3 · Answer 2 · answered Mar 09 '11 at 23:35

Yes
Makes no sense as the driver can combine these matrices for you (it knows they are uniforms, so will not change during the draw call).
Yes
only if you are CPU bound

The first thing you need to know is where exactly is your bottleneck. GPU is not an answer, because it's a complex system. The actual problem might be among these:

Shader processing (vertex/fragment/geometry)
Fill rate
Draw calls number
GPU <-> VMEM (that's where interleaving and smaller textures help)
System bus (streaming some data every frame?)

You need to perform a series of test to see the problem. For example, draw everything to a bigger FBO to see if it's a fill rate problem (or increase MSAA amount). Or draw everything twice to check the draw call overload issues.

can you explain a bit more why you say that batching should be done only if the app is cpu bound? — ashishsony, Feb 13 '13 at 11:18
(the original answer was given 2.5 years ago, so I'm trying to recall what I was thinking...). On the GPU side there is a little difference between a single call and 2 halfs of it. It's the preparation of the call on the driver side that takes a hit, which is done on CPU. — kvark, Nov 14 '13 at 15:19

score 3 · Answer 3 · answered Mar 11 '11 at 08:06

Just to add my 2 cents to @kvark and @datenwolf answers, I'd like to say that, while the points you mention are 'basic' GPU performance tips, more involved optimization is very application dependent.

In your geometry-heavy test case, you're already throwing 28 million triangles * 40 FPS = 1120 million triangles per second - this is already quite a lot : most (not all, esp Fermi) GPU out there have a triangle setup performance of 1 triangle per GPU clock cycle. Meaning that a GPU running at 800MHz, say, cannot process more than 800 million triangles per second ; this without even drawing a single pixel. NVidia Fermi can process 4 triangles per clock cycle.

If you're hitting this limit (you don't mention your hardware platform), there's not much you can do at the OpenGL/GPU level. All you can do is send less geometry, via more efficient culling (frustum or occlusion), or via a LOD scheme.

Another thing is that tiny triangles hurt fillrate as rasterizers do parrallel processing on square blocks of pixels ; see http://www.geeks3d.com/20101201/amd-graphics-blog-tessellation-for-all/.

Interesting link, but it could have been sized up in a 'bang for buck with triangles and pixels' statement. And still relates mainly to LODs, and other slightly different optimisations. Nice answer though; I did not indicate my hardware specifications, as I was not looking for hardware specific tips. — hiddensunset4, Mar 11 '11 at 10:02

score 1 · Answer 4 · answered Mar 09 '11 at 11:44

1

This very much depends on what particular hardware you are running and what the usage scenarios are. OpenGL performance tips make sense for the general case - the library is, after all, an abstraction over many different driver implementations. The driver makers are free to optimize however they want under the hood so they may remove redundant state changes or perform other optimizations without your knowledge. On another device, they may not. It is best to stick to best practices to have a better chance of good performance over a range of devices.

answered Mar 09 '11 at 11:44

Luther

1,786
3
21
38

Well I guess this can be best looked at as, less optimisations specific to OpenGL, and more closely to, good (and performance rewarding) habits for graphics programming. – hiddensunset4 Mar 09 '11 at 11:46
Some general rules of thumb for optimal use of current hardware accelerated graphics libraries would be : Don't change state too often and Batch Batch Batch. Rules of optimization are, however, not set in stone over different generations of hardware and what is true today wasn't true of all past hardware and may not be true of future hardware. Always have an appreciation of the cache and the limitations and strengths of the hardware you're working on. – Luther Mar 09 '11 at 11:55
The wisdom I've heard is that optimizing to your specific hardware is a fools game, because the behavior can radically alter between hardware generations or even between driver versions. You're better off optimizing for the API (in this case, minimal state change as has been said) and letting the hardware catch up where you can't optimize anymore. – Jherico Dec 27 '12 at 07:27

OpenGL Low-Level Performance Questions

4 Answers4

Linked