Using RSI/RDI vs r8-r15 (speed optimization)

Question

I want to optimize my function as much as possible and one of the things I did was use r8 as a pointer because that's the register the pointer gets pushed into in x64 funtions.

But would pushing RSI or RDI, moving the pointer to them and using them be faster later in a loop?

For example, mov [RSI],DL ;would complie to 2 bytes And: mov [r8],DL ; would complie to 3 bytes

So if I did a loop 100 to 200 times would r8 be slower because of the extra byte to decode? Or does pushing RSI and moving the pointer eliminate any possible speed increase? Obviously the push and mov would happen outside the loop.

The effect of this supposed optimization would be negligible. I suggest to concentrate on other aspects. — zx485, Jul 19 '18 at 22:21
In most cases the difference between the +1B opcode will be probably unmeasurable (cache saves), you would have to hit many more other constraints and corner cases and have some kind of complex (longer) code inside the loop to make that +1B significant in the profiling data. In 90+% you will bottleneck earlier on something else. As always with tuning performance, write first correct version first, measure that one well, then evaluate results to see what is the actual bottleneck. It's highly likely you will find there are other aspects worth of tuning, which will reap lot more gains when tuned. — Ped7g, Jul 20 '18 at 08:36

Peter Cordes · Accepted Answer · 2018-07-20T08:50:36.823

Depends on the CPU. Usually an average instruction size of 4 is fine to avoid front-end bottlenecks even on old CPUs like Core2.

Modern CPUs like Sandybridge-family and Ryzen cache decoded uops and are less sensitive to code-size (or alignment) inside loops, only in the large scale for L1i and uop-cache footprint.

Nehalem has a "loop buffer" for small loops up to 28 uops. (SnB family has this too, except Skylake/Kaby Lake where it's disabled by a microcode update so they run even small loop from the uop cache). Core2 has a pre-decode loop buffer for up to 64 bytes. (See Agner Fog's guides).

But yes, in general higher code density is better, so favour non-REX registers for pointers and 32-bit values, using r8-r15 for 64-bit integers that always need a REX.W anyway. But usually not worth spending extra instructions to make this happen. uop count is usually a bigger deal than code-size, especially inside a loop.

Profile with performance counters to find out if there are any front-end bottlenecks in your loop. If so, sure saving/restoring some more low regs like RBP and using them instead of R8 inside your function is useful. (But remember that [rbp] actually needs a disp8=0, [rbp+0].)

Using RSI/RDI vs r8-r15 (speed optimization)

1 Answers1