Cache-friendly vertex definition?

Question

I am writing an OpenGL application and for vertices, normals, and colors, I am using separate buffers as follows:

GLuint vertex_buffer, normal_buffer, color_buffer;

My supervisor tells me that if I define an struct like:

struct vertex {
    glm::vec3 pos;
    glm::vec3 normal;
    glm::vec3 color;
};
GLuint vertex_buffer;

and then define a buffer of these vertices, my application will gets so much faster because when the position is cached the normals and colors will be in cache line.

What I think is that defining such struct is not having that much affect on the performance because defining the vertex like the struct will cause less vertices in the cacheline while defining them as separate buffers, will cause to have 3 different cache lines for positions, normals and colors in the cache. So, nothing has been changed. Is that true?

score 4 · Answer 1 · answered Apr 14 '15 at 09:45

4

Depends on the GPU architecture.

Most GPUs will have multiple cache lines (some for uniforms, others for vertex attributes, others for texture sampling)

Also when the vertex shader is nearly done the GPU can pre-fetch the next set of attributes into the cache. So that by the time the vertex shader is done the next attributes are right there ready to be loaded into the registers.

tl;dr don't bother with these "rule of thumbs" unless you actually profile it or know the actual architecture of the GPU.

answered Apr 14 '15 at 09:45

ratchet freak

47,288
5
68
106

So, you mean the GPU can read the different cache lines at the same time? for example vertex position, normal and derivs at the same time from the different buffers? – mmostajab Apr 14 '15 at 09:48
1

@mmostajab the reading the attributes is very predictable so the prefetch will be used more and have much better effect. – ratchet freak Apr 14 '15 at 09:53
1

@mmostajab: Yes modern GPUs can access their cache lines concurrently. – datenwolf Apr 14 '15 at 09:58

score 4 · Accepted Answer · edited May 23 '17 at 12:21

First of all, using separate buffers for different vertex attributes may not be a good technique.

Very important factor here is GPU architecture. Most (especially modern) GPUs have multiple cache lines (data for Input Assembler stage, uniforms, textures), but fetching input attributes from multiple VBOs can be inefficient anyway (always profile!). Defining them in interleaved format can help improve performance:

enter image description here

And that's what you would get, if you used such struct.

However, that's not always true (again, always profile!) - although interleaved data is more GPU-friendly, it needs to be properly aligned and can take significantly more space in memory.

But, in general:

Interleaved data formats:

Cause less GPU cache pressure, because the vertex coordinate and attributes of a single vertex aren't scattered all over in memory. They fit consecutively into few cache lines, whereas scattered attributes could cause more cache updates and therefore evictions. The worst case scenario could be one (attribute) element per cache line at a time because of distant memory locations, while vertices get pulled in a non-deterministic/non-contiguous manner, where possibly no prediction and prefetching kicks in. GPUs are very similar to CPUs in this matter.

Are also very useful for various external formats, which satisfy the deprecated interleaved formats, where datasets of compatible data sources can be read straight into mapped GPU memory. I ended up re-implementing these interleaved formats with the current API for exactly those reasons.

Should be layouted alignment friendly just like simple arrays. Mixing various data types with different size/alignment requirements may need padding to be GPU and CPU friendly. This is the only downside I know of, appart from the more difficult implementation.

Do not prevent you from pointing to single attrib arrays in them for sharing.

Source

Further reads:

Best Practices for Working with Vertex Data

Vertex Specification Best Practices

from the openGL wiki (your last link): "How much interleaving attributes helps in rendering performance is not well understood. Profiling data are needed. Interleaved vertex data may take up more room than un-interleaved due to alignment needs." — ratchet freak, Apr 14 '15 at 10:08
That you are advocating interleaved when the wiki doesn't say anything (pro or contra) about it. — ratchet freak, Apr 14 '15 at 10:28
`[...] and then store each of these interleaved vertex blocks sequentially, again combining all the vertex attributes into a single buffer. [...] The optimal layout depends on the specific GPU and driver (plus OpenGL implementation).` — Mateusz Grzejek, Apr 14 '15 at 10:40
Another advantage of using interleaved attributes in a single buffer vs. separate buffers for each attribute is that there are fewer buffers that need to be bound. So it can reduce CPU overhead for setting up the state. — Reto Koradi, Apr 15 '15 at 02:22
@RetoKoradi yeah you're saving a lot with something that needs to be bound only at the beginning... You save more CPU time by simply not unbinding/rebinding same buffers every single frame just because tUtORiAL you looked at didn't explain it and told you to copy paste instead. — , Dec 23 '19 at 20:21

datenwolf · Answer 3 · 2015-04-14T09:56:16.100

2

Tell your supervisor "premature optimization is the root of all evil" – Donald E. Knuth. But don't forget the next sentence "but that doesn't mean we shouldn't optimize hot spots".

So did you actually profile the differences?

Anyway, the layout of your vertex data is not critical for caching efficiency on modern GPUs. It used to be on old GPUs (ca. 2000), which is why there were functions for interleaving vertex data. But these days it's pretty much a non-issue.

That has to do with the way modern GPUs access memory and in fact modern GPUs' cache lines are not index by memory address, but by access pattern (i.e. the first distinct memory access in a shader gets the first cache line, the second one the second cache line, and so on).

edited Apr 14 '15 at 09:56

answered Apr 14 '15 at 09:48

datenwolf

159,371
13
185
298

@datenwolf I have not profiled it but in both cases I am almost getting the same fps. – mmostajab Apr 14 '15 at 09:53
3

@mmostajab: That would be a profile and the result you got is what's to be expected. In fact on modern GPUs you may actually get a (very small) performance increase by **not** interleaving the data. – datenwolf Apr 14 '15 at 09:57

score 0 · Answer 4 · answered Mar 21 '23 at 17:52

0

It souns as a good way to separate positions into distinct vbo, for rendering it at zprepass or shadowpass without fetching non affecting attributes like uv, color or normal.

answered Mar 21 '23 at 17:52

holydel

3
2

1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 28 '23 at 00:56

Cache-friendly vertex definition?

4 Answers4

Linked