Once upon a time, when > was faster than < ... Wait, what?

Question

I am reading an awesome OpenGL tutorial. It's really great, trust me. The topic I am currently at is Z-buffer. Aside from explaining what's it all about, the author mentions that we can perform custom depth tests, such as GL_LESS, GL_ALWAYS, etc. He also explains that the actual meaning of depth values (which is top and which isn't) can also be customized. I understand so far. And then the author says something unbelievable:

The range zNear can be greater than the range zFar; if it is, then the window-space values will be reversed, in terms of what constitutes closest or farthest from the viewer.

Earlier, it was said that the window-space Z value of 0 is closest and 1 is farthest. However, if our clip-space Z values were negated, the depth of 1 would be closest to the view and the depth of 0 would be farthest. Yet, if we flip the direction of the depth test (GL_LESS to GL_GREATER, etc), we get the exact same result. So it's really just a convention. Indeed, flipping the sign of Z and the depth test was once a vital performance optimization for many games.

If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.

Is the author making things up, am I misunderstanding something, or is it indeed the case that once < was slower (vitally, as the author says) than >?

Thanks for clarifying this quite curious matter!

_{Disclaimer: I am fully aware that algorithm complexity is the primary source for optimizations. Furthermore, I suspect that nowadays it definitely wouldn't make any difference and I am not asking this to optimize anything. I am just extremely, painfully, maybe prohibitively curious.}

(a < b) is identical to (b > a) so there is absolutely no need to implement both compare operations in hardware. The difference in performance is result of what happens as result of the compare operation. This is a long and winding road to take to explain all of the side-effects but here are a few pointers. Games used to fill depth buffer to avoid more expensive fragment processing for fragments that failed depth test. Quake used to split depth range into two halves to avoid clearing the frame buffer because the game always filled every pixel on screen and so on. — t0rakka, Dec 27 '17 at 14:24
Here is the [archived version of the referenced OpenGL tutroial](https://web.archive.org/web/20150225192611/http://www.arcsynthesis.org/gltut/index.html). I can't seem to find the quoted snippet though, maybe it got taken out. — Fons, Jul 08 '18 at 13:03

score 353 · Accepted Answer · answered Sep 07 '11 at 20:34

If I understand correctly, performance-wise, flipping the sign of Z and the depth test is nothing but changing a < comparison to a > comparison. So, if I understand correctly and the author isn't lying or making things up, then changing < to > used to be a vital optimization for many games.

I didn't explain that particularly well, because it wasn't important. I just felt it was an interesting bit of trivia to add. I didn't intend to go over the algorithm specifically.

However, context is key. I never said that a < comparison was faster than a > comparison. Remember: we're talking about graphics hardware depth tests, not your CPU. Not operator<.

What I was referring to was a specific old optimization where one frame you would use GL_LESS with a range of [0, 0.5]. Next frame, you render with GL_GREATER with a range of [1.0, 0.5]. You go back and forth, literally "flipping the sign of Z and the depth test" every frame.

This loses one bit of depth precision, but you didn't have to clear the depth buffer, which once upon a time was a rather slow operation. Since depth clearing is not only free these days but actually faster than this technique, people don't do it anymore.

The reason clearing the depth buffer is faster these days has two reasons, both of them based around the fact that the GPU uses a hierarchical depth buffer. Therefor only has to clear set the tile states to clear (which is fast), changing the depth compare sign, however, means that the entire HiZ buffer needs to be flushed because it only stores a min or max value depending on the compare sign. — Jasper Bekkers, Jun 29 '12 at 19:41
@NicolBolas: PerTZHX's comment, the link to your tutorial in my question has gone dead. Could you please let us all know where have the tutorials move and optionally edit the question, please? — Armen Tsirunyan, Mar 24 '15 at 12:57
The tutorials are available on the web archive. If @NicolBolas permits, it would be helpful for the community if we could move them to a more accessible location. Maybe GitHub or something. http://web.archive.org/web/20150215073105/http://arcsynthesis.org/gltut/ — ApoorvaJ, Jun 27 '15 at 13:21

score 3 · Answer 2 · answered Sep 08 '11 at 02:24

The answer is almost certainly that for whatever incarnation of chip+driver was used, the Hierarchical Z only worked in the one direction - this was a fairly common issue back in the day. Low level assembly/branching has nothing to do with it - Z-buffering is done in fixed function hardware, and is pipelined - there is no speculation and hence, no branch prediction.

score -9 · Answer 3 · edited Sep 07 '11 at 20:56

-9

It has to do with flag bits in highly tuned assembly.

x86 has both jl and jg instructions, but most RISC processors only have jl and jz (no jg).

edited Sep 07 '11 at 20:56

Armen Tsirunyan

130,161
59
324
434

answered Sep 07 '11 at 18:44

Joshua

40,822
8
72
132

@Armen: that depends on how often the call is used, doesn't it? – Max Lybbert Sep 07 '11 at 19:12
2

If that's the answer, it raises new questions. Was "branch taken" slower than "branch ignored" on early RISC processors? It certainly isn't that way now in any measurable way as far as I know. Were you supposed to write `for` loops with an unconditional branch backwards and a conditional, seldom taken branch forward to exit the loop then? Sounds awkward. – Pascal Cuoq Sep 07 '11 at 19:43
I thought on modern architectures branch prediction depends on the direction of the branch [so conditional branches backwards are faster if taken, forwards are faster if not taken]. – Random832 Sep 07 '11 at 19:49
@Random: Do GPUs have branch predictors? – Oliver Charlesworth Sep 07 '11 at 19:54
@Oli Charlesworth Not any one I'm aware of and at least with Nvidias tesla architecture they're using predicates much more than branches. About branch predictions on early RISC, that's obviously extremely implementation dependent but early MIPS architectures always assumed the branch was **not** taken - the reason was probably that that simplifies the design for a pipelined CPU (you handle a branch as any other instruction and continue filling the pipe, if the branch is taken flush it) – Voo Sep 07 '11 at 20:20
@Random832 On modern CPUs branch predictors are much more complicated than that. Intel used that solution (always assume backwards branch is taken) in the pentium days because it was still simple and obviously good for loops. Today we've local/global predictors that recognize patterns and whatnot - best to just assume the CPU knows what it does (withstanding mobile solutions; no idea what ARM is doing, but certainly something simpler that uses less power) – Voo Sep 07 '11 at 20:24
56

-1: This question has _nothing to do with CPUs_. GL_LESS and GL_GREATER are depth comparison operations, which run on GPUs. – Nicol Bolas Sep 07 '11 at 20:29
1

@Nicol: Did they use to run on GPU's in those days as well? – Armen Tsirunyan Sep 07 '11 at 20:35
1

A GPU is a RISC-VLIW-vector class of CPU. Same general principle applies. – Joshua Sep 07 '11 at 21:00
@Joshua Umn, no. "A" GPU can be basically anything. ATI used a VLIW architecture for a long time (and they're changing right now) that's all. For eg the tesla architecture the whole idea wouldn't make sense at all. – Voo Sep 07 '11 at 21:03
9

Funny how much rep you can get for an answer that is correct to the title but has very little to do with the actual question. – Joshua Sep 08 '11 at 01:12
8

+1 No, this answer is correct to at least part of the question. The question is: "Is the author making things up, am I misunderstanding something, or is it indeed the case that once < was slower (vitally, as the author says) than >?". There are three options given. This answer is responding on the possibility of option 3. Nowhere in the article is the technology of the CPU/GPU given, nor that it must be a GPU (first 3D games where on CPU). Ok... I don't think there were many 3d Games on RISC :-) – xanatos Sep 14 '11 at 09:57
4

(and the GPU tag was added at 20:34. The first revision had only the CPU tag. This response was written at 18:44) – xanatos Sep 14 '11 at 10:12
> can be implemented with < by swapping the inputs. The compiler is more than capable of making this little De Morgan sleight-of-hand when it knows the target architecture. Pretty likely the GPU is internally doing just that; selecting the inputs based on compare state but that is not really the interesting bit. The interesting performance bit is what happens after the compare operation; is the fragment killed or still alive. – t0rakka Dec 27 '17 at 14:37
@SnappleLVR: You overestimate the power of optimizers circa 1990. This question belongs on retrocomputing, which didn't exist when it was asked. – Joshua Dec 27 '17 at 16:35
I don't think so. The mindset I had when responding was that the post mentioned that some RISC processors only had less-than compare; that transformation is obviously choice for greather-than when there is only less-than to use in the ISA. I wasn't thinking in terms of optimization here at all. The same principle applies to GPU design, especially mobile graphics cores because the real-estate is at premium; smaller chip generally requires less power and if you can replace a full compare operation with simple input selection you should probably do it. Area/Power optimization. – t0rakka Dec 28 '17 at 08:15

Once upon a time, when > was faster than < ... Wait, what?

3 Answers3

Linked