7

Is a pointer indirection (to fetch a value) more costly than a conditional?

I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.

I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.

Does anyone have solid data (or a justifiable opinion) on the matter?


EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.

If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.

Community
  • 1
  • 1
Philip Conrad
  • 1,451
  • 1
  • 13
  • 22
  • 3
    To answer your edit--on PowerPC, a correctly predicted branch can be zero cycles. It is actually lower cost than a load. – StilesCrisis Jan 10 '13 at 00:53

4 Answers4

5

This is very much dependant on the circumstances.

1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?

A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".

2 What processor is it?

Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.

3 How "predictable" is the condition?

Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.

4 How "busy" and "dirty" is the cache?

If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.

The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.

On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.

And finally, of course, the compiler USUALLY knows quite well what works best... ;)

In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Did you have a particular target CPU in mind for your stats (like the Pentium 4 you mention), or were you making educated guesses? Your figures seem reasonable, but if you can be more specific, it might make your answer more useful. – Philip Conrad Jan 10 '13 at 01:15
  • The numbers aren't that far off on any of the modern ones. They all have FAIRLY long pipelines (20 stages+), and it takes about the same time to read from memory. – Mats Petersson Jan 10 '13 at 08:39
  • 1
    Note also the modern compilers often avoid branches even when there are some, and I ran some tests many years ago on AMD and Intel hardware (P4) for branch efficiency/branch prediction. I created a fairly long string (32K if I remember right) with a mix of upper/lower case characters, and then counted how many upper and lower case there were. Then I sorted the string, and it was 4-7 times faster. I also wrote a "branchless" version, with some assembler code and using conditional instructions that don't branch to update the count, and it was noticeably faster than the sorted variant. – Mats Petersson Jan 10 '13 at 08:43
  • 1
    I should add that I sent that little benchmark to both GCC and Microsoft developers (where I worked at the time had reasonable connections into those dev-teams - I personally have connections with the GCC x86 development team), and I have tried this with much later compilers, and they work out "how to do this without branching" in similar code [I "lost" my benchmark as such, but I have looked at the code generated of "if it's A do x; else do y" where x and y are simple operations. – Mats Petersson Jan 10 '13 at 09:01
3

It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.

My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.

StilesCrisis
  • 15,972
  • 4
  • 39
  • 62
  • Actually, when you say "pointer indirection," do you really mean "calling through a function pointer"? I was interpreting it as "reading a value from a pointer." – StilesCrisis Jan 10 '13 at 00:32
  • StilesCrisis's answer was more or less what I was expecting. Branching costs vary across platforms. I put the question out there to see if anyone had already done such a test. – Philip Conrad Jan 10 '13 at 00:36
  • 1
    Well, honestly, my answer is mostly unchanged. The cost of calling through a function pointer also varies wildly depending on CPU type. On older CPUs it was very rough, but on modern CPUs it is very much on par with conditional instructions; both use the branch predictor and if mispredicted become expensive. So again you would really need to test it on your target platform and see how the predictor holds up. – StilesCrisis Jan 10 '13 at 00:37
  • I should have been more specific in my original post; I'm interested in the cost of fetching a value via indirection, versus the cost of branching. – Philip Conrad Jan 10 '13 at 00:41
  • Tell you what. To avoid going in circles, post two samples (method A, method B) of your considered alternatives. – StilesCrisis Jan 10 '13 at 00:42
2

It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.

In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.

It all boils down to: test it yourself. The answer is not definitive for all problems.

MVittiS
  • 434
  • 3
  • 13
  • I understand that, and I've altered the question to reflect that there is no "general case". In particular, thank you for noting the PowerPC case. – Philip Conrad Jan 10 '13 at 00:58
0

Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.

Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

Adrián
  • 6,135
  • 1
  • 27
  • 49