This seems like a question which has been answered throughout time for one IHV or another but recently I have have been trying to come to a consensus about vertex layouts and the best practices for a modern renderer across all IHVs and architectures. Before someone says benchmark, I can't easily do that as I don't have access to a card from every IHV and every architecture from the last 5 years. Therefore, I am looking for some best practices that will work decently well across all platforms.
First, the obvious:
- Separating position from other attributes is good for:
- Shadow and depth pre-passes
- Per-triangle culling
- Tiled based deferred renderers (such as Apple M1)
- Interleaved is more logical on the CPU, can have a
Vertex
class. - Non-interleaved can make some CPU calculations faster due to being able to take advantage of SIMD.
Now onto the less obvious.
Many people quote NVIDIA as saying that you should always interleave and moreover you should align to 32 or 64 bytes. I have not found the source of this but have instead found a document about vertex shader performance by NVIDIA but it is quite old (2013) and is regarding the Tegra GPU which is mobile, not desktop. In particular it says:
Store vertex data as interleaved attribute streams ("array of structures" layout), such that "over-fetch" for an attribute tends to pre-fetch data that is likely to be useful for subsequent attributes and vertices. Storing attributes as distinct, non-interleaved ("structure of arrays") streams can lead to "page-thrashing" in the memory system, with a massive resultant drop in performance.
Fast forward 3 years to GDC 2016 and EA gives a presentation which mentions several reasons why you should de-interleave the vertex buffers. However, this recommendation seems to be tied to AMD architectures, in particular GCN. While they make a cross platform case for separating the position they propose de-interleaving everything with the statement that it will allow the GPU to:
Evict cache lines as quickly as possible
And that it is optimal for GCN (AMD) architectures.
This seems to be in conflict to what I have heard elsewhere which says to use interleaved in order to make the most use of a cache line. But again, that was not in regard to AMD.
With many different IHVs, Intel, NVIDIA, AMD, and now Apple with the M1 GPU, and each one having many different architectures it leaves me in the position of being completely uncertain about what one should do today (without the budget to test on dozens of GPUs) in order to best optimize performance across all architectures without resulting in
a massive resultant drop in performance
on some architectures. In particular, is de-interleaved still best on AMD? Is it no longer a problem on NVIDIA, or was it never a problem on desktop NVIDIA GPUs? What about the other IHVs?
NOTE: I am not interested in mobile, only all desktop GPUs in the past 5 years or so.