Why not store function parameters in XMM vector registers?

Question

I'm currently reading the book: "Computer Systems - A Programmers Perspective". I've found out that, on the x86-64 architecture, we are limited to 6 integral parameters which will be passed to a function in registers. The next parameters will be passed on the stack.

And also, the first up-to-8 FP or vector args are passed in xmm0..7.

Why not use float registers in order to store the next parameters, even when the parameters are not single/double precision variables?

It would be much more efficient (as far as I understood) to store the data in registers, than to store it to memory, and then read it from memory.

Seemingly, the [XMM* registers](https://en.wikipedia.org/wiki/X86_calling_conventions#Microsoft_x64_calling_convention) can similarly be used for passing, but only for floating point parameters. — StuartLC, Nov 14 '15 at 10:12
@StuartLC why not to store integral data in them, since the registers store bits. They don't really know if integral or floating point data is stored. If I move them lately to integral registers like %rax everything should be fine. I think this should be a performance boost, even though the code would be messy... — denis631, Nov 14 '15 at 10:19
At a guess, it would be more effort to encode / decode integers into floats and vice versa than passing on the stack. But hopefully an expert can answer :) — StuartLC, Nov 14 '15 at 10:26
@StuartLC I do not want to convert integer to float and then again from float to int. The idea was to store the data without conversion, in order avoid the precision loss. If we would store a pointer address at a float register, and after conversion operations, the value of the address would be rounded, we'd get segmentation fault, which is not good :) — denis631, Nov 14 '15 at 10:39

Peter Cordes · Accepted Answer · 2015-12-18T09:28:02.633

Most functions don't have more than 6 integer parameters, so this is really a corner case. Passing some excess integer params in xmm registers would make the rules for where to find floating point args more complicated, for little to no benefit. Besides the fact that it probably wouldn't make code any faster.

A further reason for storing excess parameters in memory is that you the function probably won't use them all right away. If you want to call another function, you have to save those parameters from xmm registers to memory, because the function you call will destroy any parameter-passing registers. (And all the xmm regs are caller-saved anyway.) So you could potentially end up with code that stuffs parameters into vector registers where they can't be used directly, and from there stores them to memory before calling another function, and only then loads them back into integer registers. Or even if the function doesn't call other functions, maybe it needs the vector registers for its own use, and would have to store params to memory to free them up for running vector code! It would have been easier just to push params onto the stack, because push very heavily optimized, for obvious reasons, to do the store and the modification of RSP all in a single uop, about as cheap as a mov.

There is one integer register that is not used for parameter passing, but also not call-preserved in the SysV Linux/Mac x86-64 ABI (r11). It's useful to have a scratch register for lazy dynamic linker code to use without saving (since such shim functions need to pass on all their args to the dynamically-loaded function), and similar wrapper functions.

So AMD64 could have used more integer registers for function parameters, but only at the expense of the number of registers that called functions have to save before using. (Or dual-purpose r10 for languages that don't use a "static chain" pointer, or something.)

Anyway, more parameters passed in registers isn't always better.

xmm registers can't be used as pointer or index registers, and moving data from the xmm registers back to integer registers could slow down the surrounding code more than loading data that was just stored. (If any execution resource is going to be a bottleneck, rather than cache misses or branch mispredicts, it's more likely going to be ALU execution units, not load/store units. Moving data from xmm to gp registers takes an ALU uop, in Intel and AMD's current designs.)

L1 cache is really fast, and store->load forwarding makes the total latency for a round trip to memory something like 5 cycles on e.g. Intel Haswell. (The latency of an instruction like inc dword [mem] is 6 cycles, including the one ALU cycle.)

If moving data from xmm to gp registers was all you were going to do (with nothing else to keep the ALU execution units busy), then yes, on Intel CPUs the round trip latency for movd xmm0, eax / movd eax, xmm0 (2 cycles Intel Haswell) is less than the latency of mov [mem], eax / mov eax, [mem] (5 cycles Intel Haswell), but integer code usually isn't totally bottlenecked by latency the way FP code often is.

On AMD Bulldozer-family CPUs, where two integer cores share a vector/FP unit, moving data directly between GP regs and vector regs is actually quite slow (8 or 10 cycles one way, or half that on Steamroller). A memory round trip is only 8 cycles.

32bit code manages to run reasonably well, even though all parameters are passed on the stack, and have to be loaded. CPUs are very highly optimized for storing parameters onto the stack and then loading them again, because the crufty old 32bit ABI is still used for a lot of code, esp. on Windows. (Most Linux systems mostly run 64bit code, while most Windows desktop systems run a lot of 32bit code because so many Windows programs are only available as pre-compiled 32bit binaries.)

See http://agner.org/optimize/ for CPU microarchitecture guides to learn how to figure out how many cycles something will actually take. There are other good links in the x86 wiki, including the x86-64 ABI doc linked above.

thanks for a detailed answer! I'm really shocked, that "moving data from the xmm registers back to integer registers would be slower, than loading data that was just stored." — denis631, Nov 14 '15 at 10:35
@denis631: actually that's only true on AMD Bulldozer-family. I should have said "not much faster", and it uses up ALU execution resources instead of load/store execution resources. — Peter Cordes, Nov 14 '15 at 10:37
@denis631: updated to explain about using ALU execution units vs. load / store units, and latency vs. throughput. And updated again to point out that too many param-passing registers actually hurts when the called function wants to call other functions. — Peter Cordes, Nov 14 '15 at 11:02
@denis631 - This is a good answer that explains the balance / pressure between the cost of passing by register vs caller / callee saving. I haven't seen any reference to it in this Q&A, but you might like to look at the [ABI documentation](http://www.x86-64.org/documentation.html). — Brett Hale, Nov 14 '15 at 15:06

Still Dead · Answer 2 · 2015-11-14T10:29:34.943

3

I think this isn't good idea because:

You can't use FPU/SSE registers as general purpose registers. I mean, this code isn't correct(NASM):
```
mov byte[st0], 0xFF
```
If compare sending data to/from FPU/SSE with general purpose registers/memory, FPU/SSE is very slow.

EDIT: Remember, I may be not right.

edited Nov 14 '15 at 10:29

answered Nov 14 '15 at 10:24

Still Dead

61
6

2: loading/storing to/from xmm/mmx/fpu registers is at worst one cycle more latency than loading/storing to/from GP registers. Moving data directly between XMM and GP registers (with `movd` / `movq` / `pinsrw` / `pextrw`) is only one cycle latency on Intel, but several cycles of latency on AMD Bulldozer-family. (BD CPUs share a vector/FPU execution unit between a pair of integer cores. Each core has its own xmm/fpu architectural state, of course, but has to compete with the other core for execution resources.) AMD hadn't designed BD yet when they made the AMD64 ABI. – Peter Cordes Nov 14 '15 at 10:34
Anyway, instead of "very slow", just say "no faster than loading from memory". Your point #1 is spot-on, though. – Peter Cordes Nov 14 '15 at 10:35
Maybe. I don't check load/store from/in FPU/SSE, but SSE wasn't really fast in my case. It was 5-6 times slower. Thanks for correction – Still Dead Nov 14 '15 at 10:39
What was 5-6 times slower than what, and on what CPU? On modern Intel CPUs, 128bit SSE loads/stores have identical throughput to GP register loads/stores. (1 store and 2 loads per cycle on Haswell and later). The round-trip store->load latency for 128bit SSE stores (that don't cross a cache line boundary) is 6 cycles, vs. 5 for GP registers. (Intel SnB and Haswell). Numbers from Agner Fog's instruction tables. – Peter Cordes Nov 14 '15 at 10:50
SSE vector length realization. FPU was 5-6 times faster than SSE in this case. But, I'm beginner, and I may wrote incorrect or slower code, but it's off-topic, I think. – Still Dead Nov 14 '15 at 10:56
I don't know what you mean by "SSE vector length realization". Those are all English words, but that phrase doesn't mean anything to me or google. If you did something where the x87 FPU was 5 times faster than your SSE implementation, you did something wrong in your SSE implementation. The memory round-trip latency for `fld [m64] / fstp [m64]` is 7 cycles on Intel Haswell, one higher than for 128b vectors. (But the same as for 256b vectors). You still didn't say what CPU you're talking about. That matters, because different CPUs have different internals... – Peter Cordes Nov 14 '15 at 11:06
(Eh... My English is bad. Sorry) SSE realization of vector(geometry) length. CPU: Intel Penryn(so old, I know), and yes, that can be my mistake. – Still Dead Nov 14 '15 at 11:13

Why not store function parameters in XMM vector registers?

2 Answers2

Linked