19

What is preferrable, from an effiency point of view (or another point of view if it's important) ?

Situation
An OpenGL application that draws many lines at different positions every frame (60 fps). Lets say there are 10 lines. Or 100 000 lines. Would the answer be different?

  • #1 Have a static VBO that never changes, containing 2 vertices of a line

Every frame would have one glDrawArrays call per line to draw, and in between there would be matrix transformations to position our one line

  • #2 Update the VBO with the data for all the lines every frame

Every frame would have a single draw call

mk12
  • 25,873
  • 32
  • 98
  • 137

1 Answers1

29

The second is incredibly more efficient.

Changing states, particularly transformation and matrices, tends to cause recalculation of other states and generally more math.

Updating geometry, however, simply involves overwriting a buffer.

With modern video hardware on rather massive bandwidth busses, sending a few floats across is trivial. They're designed for moving tons of data quickly, it's a side effect of the job. Updating vertex buffers is exactly what they do often and fast. If we assum points of 32 bytes each (float4 position and color), 100000 line segments is less than 6 MB and PCIe 2.0 x16 is about 8 GB/s, I believe.

In some cases, depending on how the driver or card handles transforms, changing one may cause some matrix multiplication and recalculating of other values, including transforms, culling and clipping planes, etc. This isn't a problem if you change the state, draw a few thousand polys, and repeat, but when the state changes are often, they will have a significant cost.

A good example of this being previously solved is the concept of batching, minimizing state changes so more geometry can be drawn between them. This is used to more efficiently draw large amounts of geometry.

As a very clear example, consider the best case for #1: transform set triggers no additional calculation and the driver buffers zealously and perfectly. To draw 100000 lines, you need:

  • 100000 matrix sets (in system RAM)
  • 100000 matrix set calls with function call overhead (to video driver, copying the matrix to the buffer there)
  • 100000 matrices copied to video RAM, performed in a single lump
  • 100000 line draw calls

The function call overhead alone is going to kill performance.

On the other hand, batching involves:

  • 100000 point calculations and sets, in system RAM
  • 1 vbo copy to video RAM. This will be a large chunk, but a single contiguous chunk and both sides know what to expect. It can be handled well.
  • 1 matrix set call
  • 1 matrix copy to video RAM
  • 1 draw call

You do copy more data, but there's a good chance the VBO contents still aren't as expensive as copying the matrix data. Plus, you save a huge amount of CPU time in function calls (200000 down to 2). This simplifies life for you, the driver (which has to buffer everything and check for redundant calls and optimize and handle downloading) and probably the video card as well (which may have had to recalculate). To make it really clear, visualize simple code for it:

1:

for (i = 0; i < 100000; ++i)
{
    matrix = calcMatrix(i);
    setMatrix(matrix);
    drawLines(1, vbo);
}

(now unwrap that)

2:

matrix = calcMatrix();
setMatrix(matrix);
for (i = 0; i < 100000; ++i)
{
    localVBO[i] = point[i];
}
setVBO(localVBO);
drawLines(100000, vbo);
Community
  • 1
  • 1
ssube
  • 47,010
  • 7
  • 103
  • 140
  • 3
    Ok, so does that mean it is always better to bake into a VBO and then draw, as opposed to using matrices to transform? What if I have say a handful, maybe 10, moving textured quads. Would it really be better to calculate the coordinates of the objects, recreate the VBO, upload and draw? As opposed to using a matrix translation (which i optimize to be simply 2 additions rather than 64 multiplications and 48 additions) on each object and then drawing for each. – mk12 Sep 24 '11 at 05:21
  • 1
    "If we assum points of 32 bytes each (float4 position and color)" And it would take virtually no effort to cut that in half: vec3 of position and a vec4 of unsigned byte colors. Also, you should investigate [buffer object streaming](http://www.opengl.org/wiki/Buffer_Object_Streaming) to improve the performance of this. – Nicol Bolas Sep 24 '11 at 05:21
  • 1
    @Mk12 You have asked a much more complicated question. Your question's answer was simple because each object was very tiny and even when you had a lot of them, the vertex data was small. Once you have larger numbers of objects, the cost of calculating their positions becomes much more significant, as does the upload costs. – Nicol Bolas Sep 24 '11 at 05:23
  • 1
    @Nicol: 10 textured quads (xyst) is much tinier than 100 000 lines (xy), is it not? I'm just trying to make sure whether peachykeen's answer can be applied for say a relatively simply 2D game. – mk12 Sep 24 '11 at 05:27
  • 1
    @Mk12: If you're making a "relatively simply 2D game," performance really should be the least of your concerns. You could be using immediate mode. Spend more time making your game and less time dealing with this kind of minutiae. – Nicol Bolas Sep 24 '11 at 05:43
  • 1
    @Nicol: I'm not using immediate mode because it's depricated. And you're probably right, but I'm a bit of a perfectionist, so even if there is no difference in performance, I just want to do it the *right* way. – mk12 Sep 24 '11 at 06:00
  • 4
    It's just math. In the case of 10 quads, the costs shift; the geometry is smaller and you're forced to have more function call overhead. In your example case, the large number of points makes it a very clear decision. Under different circumstances, you'll need to profile and see where the bottlenecks are and optimize accordingly. – ssube Sep 24 '11 at 06:14
  • 3
    One thing I'd like to point out is, that OpenGL matrix operations (which are deprecated in the later versions) don't happen on the GPU. They're executed by the driver. It is this driver side execution, that may trigger cascades of dependant state changes. Worse, you would do the state changes between drawing, so you'd iterrupt the GPU in its workflow; it takes superscalar processors some working cycles of a task to rech full efficiency. So a lot of individual movements are best carried out in a large buffer. One single, even movement of the whole set however is better done by matrix. – datenwolf Sep 24 '11 at 09:18
  • 1
    @datenwolf: I'm not using the deprecated OpenGL matrix functions, I implemented my own. I have my translate function optimized to simply add x and y rather than multiplying an addition matrix. However, some of my game objects rotate, so I *need* a matrix for that, regardless of whether I'm putting the values into a VBO each frame or sending it in the MVP matrix to my shader (and making multiple draw calls). – mk12 Sep 24 '11 at 14:36
  • 1
    @NicolBolas: I see what you mean, I tried both in my game and profiled them and i could barely find a difference at all. Maybe I should try to be less obssesive about efficiency ;). – mk12 Sep 30 '11 at 21:44
  • > 100000 line segments is less than 6 MB and PCIe 2.0 x16 is about 8 GB/s, I believe. 8GB/s is really not much if geometry changes every frame though, e.g. with nowadays common 240hz monitors, that means just 34 megabytes of bandwidth per-frame if one aimes to maximize framerate (or 68MB for PCIe 3, 137MB for PCIe4, etc... and that's in the very very best case) – Jean-Michaël Celerier Aug 12 '22 at 12:58