How much perf can I get using half_floats for vertex attribs?

Question

Currently I have a dynamic VBO (updated every frame, particle system). I send 4 floats for pos and 4 floats for color.

How much perf can I expect if I move to half_float data type? Is it something like 5% or maybe 30%?

Lets assume I want to send 100k...500k particles, in point sprites. So I send around 100k*8*4bytes = ~3MB

Of course I am aware that there might be differences between various GPUs...

If you have this high number of vertices, first thing I would do is to use only three floats for position and omit the homogeneous coordinate (since it will be 1 in general). This will decrease the total VBO size dramatically. — BDL, Feb 27 '15 at 12:31
yes, that's another good idea: it would save 1/8 of the bandwith. But half floats could reduce this even more — fen, Feb 27 '15 at 12:35
and how to efficiently copy data from array of vec4 into array of vec3? just a simple for loop? — fen, Feb 27 '15 at 12:58
Usually performance of the input assembler is related to the overall size and alignment of the vertex structure itself (32 or 64 bytes tend to be the most efficient). Using float16s can be useful to achieve improved vertex size/alignment. — Chuck Walbourn, Feb 27 '15 at 20:47

score 2 · Answer 1 · edited May 23 '17 at 12:23

You are paying for the more for the conversion from full float to half float each time you change the data than you would for the 2 bytes you save per half float.

So check what saves you more in the long run, less memory bandwidth or spending more time filling the VBO. If the CPU is idle often (waiting for vsync) then optimizing bandwidth at the cost of cpu time will win out, however if a lot of time is spend doing physics on the CPU then you can't afford to convert to half float.

Unless you have a very efficient conversion function for a type that is usually not built-in and you can afford the CPU hit I wouldn't bother.

Given that you are looking for a compacter data format then I suggest using a linear scale (u)int16 (set uniform to true to map {-32768, 32767} to {-1, 1} or {0, 65565} to {0, 1}) instead, if you don't need the precision near 0 then this will provide better accuracy.

score 1 · Answer 2 · answered Feb 28 '15 at 11:35

1

4 floats for color is very likely unnecessary to begin with.

You can probably get away with 4 unsigned bytes. Conversion from 8-bit fixed-point numbers to floating-point is ridiculously cheap on GPUs (this is what they were originally designed to do), so you can cut memory requirements for color down 75% with no inherent performance impact. Of course, as others have mentioned, alignment becomes an issue. Changing to a 4-byte color from 16-byte can easily mess up a vec4-aligned vertex data structure unless you pad it, add some extra attribute (like say a 3D vertex normal) or change another member of the data structure.

A vec3 position plus a 4-byte color fits nicely into cache boundaries, and you should really try that first unless you actually use a non-1.0 homogeneous coordinate (seriously)? This will give you the same memory requirements as switching everything to half-floats, but you still get full precision position.

answered Feb 28 '15 at 11:35

Andon M. Coleman

42,359
2
81
106

how to convert from vec4 to rgba8 efficiently? or maybe it's better to perform computation directly on rgba8? My initial tests shows not much performance improvement... maybe I loose time on doing conversions inefficiently (vec4 array into vec3 and then vec4 into rgba8) – fen Mar 01 '15 at 18:52
I really wouldn't expect much in the way of performance improvements, to be honest. You're doing the transformation on the CPU and then transferring it to the GPU each frame, which is going to be more of a performance bottleneck than memory bandwidth. Ultimately moving the particle simulation onto the GPU itself is ideal - this is much easier to accomplish these days thanks to transform feedback / compute shaders. – Andon M. Coleman Mar 01 '15 at 20:34
yes, that's a good option, but I just want to explore CPU particle system. In the next "release" I'll probably stick with the GPU side. – fen Mar 02 '15 at 19:28

How much perf can I get using half_floats for vertex attribs?

2 Answers2