2

Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.

For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer).

addl   $8, %rdi                # skip eight values: advance index by 8
movl   (%rbx, %rdi, 4), %eax   # load data: pointer + scaled index
movl   4(%rbx, %rdi, 4), %edx  # load data: pointer + scaled index + displacement

As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:

addl  $32, %rbx      # skip eight values: move pointer forward by 32 bytes
movl  (%rbx), %eax   # load data: pointer
addl  $4, %rbx       # point next value: move pointer forward by 4 bytes
movl  (%rbx), %edx   # load data: pointer

In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.

What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?

Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?


Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!

Krzysztof Abramowicz
  • 1,556
  • 1
  • 12
  • 30
  • 1
    General answer #0: Optimization is voodoo, and things like adding instructions or using longer instructions can speed things up in some cases. These behaviors can vary from CPU to CPU; something true on one model may not be true on a newer model. In your case, things can go either way, and there is no good way to predict without simply measuring. – Nayuki Aug 31 '13 at 21:00
  • 1
    General answer #1: http://www.agner.org/optimize/ ; http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html – Nayuki Aug 31 '13 at 21:01
  • @NayukiMinase, some v useful links. Very worth a browse. Thanks. – TerryE Sep 01 '13 at 10:35
  • @Nayuke Minase: optimization is not voodoo though it may appear to be to someone uninformed examining generated assembly code. Neither is it true that "there is no good way to predict without simply measuring." Quality compilers do a very good job of predicting without measuring - simply or otherwise. – Olof Forshell Sep 02 '13 at 18:38
  • 1
    I know of no such editor as you are asking about. A very complete sampling tool is available from AMD: CodeAnalyst for 32- and 64-bit x86. – Olof Forshell Sep 06 '13 at 09:59
  • One "instruction" yes, but these are variable length instructions, a similarly interesting question is the size of the instructions as a whole for the task, how much fetching. the math is fast, one clock so that is not an issue. examine decisions, when they happen and additional fetches. Etc, even with that as mentioned above performance is specific to the processor architecture and system. – old_timer Sep 14 '13 at 17:20

2 Answers2

2

The address arithmetic is very fast and should be used always if possible.

But here is something that the question misses.

At first you can't multiply by 32 using address arithmetic - 8 is the maximal possible constant.

The first version of the code in not complete, because it will need second instruction, that to increment rbx. So, we have following two variants:

inc  rbx          
mov  eax, [8*rbx+rdi]

vs

add  rbx, 8
mov  eax, [rbx]

This way, the speed of the two variants will be the same. The size is the same - 6 bytes as well.

So, what code is better depends only on the program context - if we have a register that already contains the address of the needed array cell - use mov eax, [rbx]

If we have register containing the index of the cell and another containing the start address, then use the first variant. This way, after the algorithm ends, we still will have the start address of the array in rdi.

johnfound
  • 6,857
  • 4
  • 31
  • 60
  • As I understand it, the complex address operand (and the circuitry lying behind it) was introduced to eliminate the need for the accessory, address-calculating instructions, thus (in many cases) compacting the code and reducing instruction-level dependencies. So, unless I already have a ready pointer, the complex addressing in usually preferable. – Krzysztof Abramowicz Sep 05 '13 at 12:02
  • @KrzysztofAbramowicz Exactly. Don't forget to vote and accept the answer you like. ;) – johnfound Sep 05 '13 at 12:08
  • As for the multiplication, there is no mistake, but the example introduces an (unnecessary) assumption that the `%rdi` strides 8 units per iteration, so the scaling factor is 4, but the data is not accessed linearly. It would have been obvious, if I hadn't forgotten to provide the loop counter advancement instruction for the first variant. I will apply your remarks to my question. – Krzysztof Abramowicz Sep 05 '13 at 12:23
2

The answer to your question depends on the given, local program flow circumstances - and they, in turn, may very somewhat between processor manufacturers and architectures. To microanalyze a single instruction or two is usually pointless. You have a multi-stage pipeline, more than one integer unit, caches and lots more that come into play and which you need to factor into your analysis.

You could try reverse-engineering by looking at generated assembly code and analyzing why the sequence looks the way it does with respect to the different hardware units that will be working on it.

Another way is to use a profiler and experiment with different constructs to see what works well and what doesn't.

You could also download the source code for gcc and see how the really cool programmers evaluate a sequence to produce the fastest possible code. One day you might become one of them :-)

In any case I expect that you will come to the conclusion that the best sequence varies greatly depending on processor, compiler, optimization level and surrounding instructions. If you are using C the quality of the source code is extremely important: garbage in = garbage out.

Olof Forshell
  • 3,169
  • 22
  • 28
  • You're right – I should have provided a broader example's context for my question. My intent was to focus on the instruction-level address arithmetic performance (carried out by a dedicated unit, at a different stage of the pipeline, right?) and its comparison to independent, ALU-based calculations executed for the same reason. Nevertheless, the question came out from a wider issue, which I'm going to reveal through an edit. – Krzysztof Abramowicz Sep 05 '13 at 11:29
  • The context for your comparison is what the respective units have been up to right up to the point where your instructions are executed. A good compiler producing optimized code will have kept track of which units in the pipeline it has put to work on this or that instruction and also correctly estimate when that work completes. It will try different instruction scenarios to find the one that works best (runs fastest). If you choose a restricted context it will be of little practical use. – Olof Forshell Sep 06 '13 at 09:50
  • 1
    One place where testing such as the one (I think) you are proposing is useful is to determine ALU or FPU efficiency when comparing different processors/architectures. You might want to check out this document (http://gmplib.org/~tege/x86-timing.pdf) which shows the rate at which certain x86-architectures can process individual instructions and their degree of parallelism. – Olof Forshell Sep 06 '13 at 09:55