1

Currently I have following setup which is working fine for far.

struct Vertex {
    glm::vec3 position;
    glm::vec3 normal;
    glm::vec2 texCoord;
}
std::vector<Vertex> vertices;

The Vertex-Attributes:

glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, sizeof(Vertex), (void*) offsetof(Vertex, Vertex::position)); 
glEnableVertexAttribArray(1);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, sizeof(Vertex), (void*) offsetof(Vertex, Vertex::normal));
glEnableVertexAttribArray(2);
glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, sizeof(Vertex), (void*) offsetof(Vertex, Vertex::texCoord));

Now I want to increase my performance by changing the vertex attributes to from float to short. I tried to start with the vertex positions.

OpenGL's Vertex Specification Best Practices tells me this:

Positions [...] To do this, you rearrange your model space data so that all positions are packed in a [-1, 1] box around the origin. You do that by finding the min/max values in XYZ among all positions. Then you subtract the center point of the min/max box from all vertex positions; followed by scaling all of the positions by half the width/height/depth of the min/max box. You need to keep the center point and scaling factors around. When you build your model-to-view matrix (or model-to-whatever matrix), you need to apply the center point offset and scale at the top of the transform stack (so at the end, right before you draw).

I also read this Thread.

That's why I added this preprocessing step mapping all vertices to [-1,1]

for (auto& v : vertices) {
    v.position = (v.position - center) * halfAxisLengths;
}

and recale it in the vertex-shader

vec4 rescaledPos = vec4(in_pos, 1.0) * vec4(halfAxisLengths, 1.0) + vec4(center, 0.0);
gl_Position = P * V * M * rescaledPos;

My vertex attribute using GL_SHORT instead of GL_FLOAT, and normalize set to GL_TRUE:

glEnableVertexAttribArray(0);
glVertexAttribPointer(0, 3, GL_SHORT, GL_TRUE, sizeof(Vertex), (void*) offsetof(Vertex, Vertex::position));

As result I just get a chaos of triangles, but not my model with increased fps.

Is this the correct way to set vertex attributes to short?

Or do I have to change my complete Vertex structure? If yes, what's the best way to do this (glm vectors with shorts?).

An working example would be great, I couldn't find any.

Community
  • 1
  • 1
Mr. X
  • 41
  • 6
  • 1
    This is aside the question, but have you considered half floats? –  Apr 21 '16 at 15:57
  • 1
    @WilliamKappler Yes, `GL_HALF_FLOAT` is working fine. But I want to use `GL_SHORT` / `GL_UNSIGNED_SHORT` (especially for texCoords) and `GL_INT_2_10_10_10_REV` for normals . – Mr. X Apr 21 '16 at 16:44
  • "*recale it in the vertex-shader*" Why don't you just put that in the matrix itself? It's just a scale/translation. Matrices can do that. – Nicol Bolas Apr 21 '16 at 17:02
  • I will do that. Now I'm in a "just testing" state. – Mr. X Apr 21 '16 at 20:59

2 Answers2

1

I adjusted the data structure for the vertex buffer:

struct newVertex {
    GLshort position[4]; // for GL_SHORT
    GLint normal; // for GL_INT_2_10_10_10_REV
    GLshort texCoord[2]; // for GL_SHORT
};

As a result I get ~20% increased performance.

Mr. X
  • 41
  • 6
0

Or do I have to change my complete Vertex structure?

Yes, OpenGL will not magically do the conversion for you. But then if performance is your goal…

Now I want to increase my performance by changing the vertex attributes to from float to short.

This would actually hurt performance. GPUs are optimized for processing vectors as floating point values. This in turn influences the memory interface, which is designed to give best performance for 32 bit aligned accesses. Bysubmitting as 16 bit short integer you're forcing the current line of GPUs to perform suboptimal memory access and an intermediary conversion step.

If performance is your goal stick to single precision float. If you don't believe me: Benchmark it.

datenwolf
  • 159,371
  • 13
  • 185
  • 298
  • I could use a 4th vector component for padding to fill the 32 bit? People [here](http://stackoverflow.com/questions/5718846/how-can-i-optimize-the-rendering-of-a-large-model-in-opengl-es-1-1/5721102) and [here](http://stackoverflow.com/questions/1287811/what-does-the-tiler-utilization-statistic-mean-in-the-iphone-opengl-es-instrumen/1316494#1316494) get more performance by using shorts. I think less memory to transfer would be faster, isn't it? – Mr. X Apr 21 '16 at 16:55
  • "*If you don't believe me: Benchmark it.*" You're the one who's making a claim of dubious nature, which flies in the face of conventional wisdom. So I would say that the burden of proof is on you. But you are right that you'd have to pad out the position to be a 4-element position. But it still saves you 32-bits per vertex. You'd save even more by compacting the normal in a 10/10/10/2 format, and the texture coordinates to 16-bit normalized shorts. That goes from 32-bytes-per-vertex to 16. Half the data size. I'm pretty sure any conversion step will be faster than the memory access. – Nicol Bolas Apr 21 '16 at 17:01
  • @Mr.X: OpenGL-ES != OpenGL. On mobile GPUs, like the one found in the iPhone saving memory bandwidth will in fact improve performance. – datenwolf Apr 21 '16 at 17:16
  • @NicolBolas: I thought so, too, for a long time. Then I attended a Khronos event two weeks ago, where GPU engineers from NVidia and AMD were present and patiently explained to me in detail, why these kinds of conversions throw a wrench in modern GPUs' gears. It even goes so far (at least as far as AMD is concerned) that the drivers may decide to actually recompile shaders for a different format and do an in-situ format conversion if they see that fit. – datenwolf Apr 21 '16 at 17:25
  • "*the drivers may decide to actually recompile shaders for a different format and do an in-situ format conversion*" AMD's drivers always do that, since their GCN hardware doesn't have vertex fetch logic at all. That doesn't change the fact that reducing memory bandwidth and pre-T&L cache hit costs are significant. If you have a link to this event, or some similar information, please share it. – Nicol Bolas Apr 21 '16 at 17:49
  • @NicolBolas: So I sat down yesterday evening to get some hard figures and wrote a benchmark. Pretty straightforward: Allocate a VBO, then in a loop update it with a compute shader (using a compute shader to eliminate transfer bottlenecks and possible in-situ conversion from the backingstore and using continuous update to prevent the driver of making an "optimized" shadow copy) and draw it with glDrawArrays (to prevent index caching) as points inside a TIME_ELAPSED query block. – The results are in and they say, that drawing time is *inversely* proportional to the viewport size covered… WTF?! – datenwolf Apr 24 '16 at 11:51
  • @NicolBolas: This just goes to show, how hard it can be to benchmark these things properly. And I didn't even measure the difference between data types yet. This was just a control to establish a baseline and test the methodology. Here's the conventional wisdom again: The more fragments you touch, the worse your fillrate and hence the longer it takes to draw this. But in a small viewport with lots of overdraw things are slowed down in ways I yet don't fully understand (I've got some ideas what's going on, but engineers from TEAM GREEN would have to chime in – tested it on a GTX980). – datenwolf Apr 24 '16 at 11:55
  • @NicolBolas: Anyway, the event I was attending was the Khronos Munich Chapter, which was mostly focused on Vulkan. After the talks, at socializing time I got to talk with engineers from Team Red, Team Green and Team Turquoise (embedded, mobile) about how to efficiently deal with packed formats; I in particular had at that time to deal with 2× 12bits packed into 3 byte data (coming from a very fast ADC) and how to get the best throughput. – datenwolf Apr 24 '16 at 11:59
  • @NicolBolas: Benchmark code is here, if you want to give it a spin (more data always appreciated): https://github.com/datenwolf/pointoverdrawbench – datenwolf Apr 24 '16 at 22:46