How do move elimination slots work in Intel CPU?

Question

Andreas Abel and Jan Reineke discuss move elimination in their paper describing uiCA:

4.1.4 Move Elimination. [...] However, this move elimination is not always successful. [...] We have developed microbenchmarks that use these counters to analyze when move elimination is successful. [...]
The following model agrees with our observations. The processor keeps track of the physical registers that are used by more than one architectural register. We say that each such physical register occupies one elimination slot. An elimination slot is released again after the corresponding registers have been overwritten.* The number of move instructions that can be eliminated in a cycle depends both on the number of available elimination slots, and on the number of successful eliminations in the previous cycle.

Where I've added emphasis on the part I don't understand.

I thought that a given physical register could be used from rename to retire only by a single architectural register. I took the meaning of the text to imply otherwise and so I'm struggling to understand how move elimination slots work (and at this point even how register renaming actually works).

I just edited some new stuff into my answer, after discussion with BeeOnRope about FLAGS also being references to PRF entries. Ping since SO doesn't notify you otherwise. — Peter Cordes, Jan 24 '23 at 15:07

Peter Cordes · Accepted Answer · 2023-01-24T15:06:04.813

The whole point of mov-elimination is that instead of allocating a new PRF entry (physical register file) and running a uop to read the value and write it to that new entry (like lea rdx, [rcx+0] would), mov rdx, rcx can be handled by having the RAT entry (register allocation table) for RDX point to the same physical register number as RCX does at that point.

So the whole idea is to bend the rule of a PRF entry being the state of a single architectural register at some point. This presumably makes it more complicated to track when a PRF entry can be freed, or for renaming later uops when two architectural registers both refer to the same physical reg, or some other complication.

"Move-elimination slots" are a separate resource, not PRF entries. They exist to solve whatever extra tracking problem Intel ran into. A move-elimination slot is freed when you overwrite the destination of the mov again later, e.g. mov ecx, edx / not ecx immediately releases whatever mov-elimination resources were needed.

Without mov-elimination, you're right about how it works; one PRF entry holds the value written to only one architectural register, and is an input dependency for any uops that read that register before it's overwritten.

Except that a PRF entry also has room for FLAGS condition codes, so after an instruction like add eax, ecx that writes both FLAGS and an integer reg, both RFLAGS and RCX point to the same physical reg. A later instruction like mov-immediate, not or lea can overwrite the gp register and leave just CF and the SPAZO group of FLAGS pointing to the old physical reg. Instructions like cmp, stc, or add [mem], eax write (part of) FLAGS but not an integer register.

But that's just two things (the separately-renamed parts of FLAGS, CF and SF/PF/AF/ZF/OF aka SPAZO) which can maybe still refer to a phys reg, other than a GP-integer register. With maybe 1 bit per phys reg to track whether it's still referenced by a GP-integer reg, retirement can free them correctly when retiring a uop that writes a GP-integer register, with maybe just a check against the retirement state of the RAT entries for FLAGS. Or maybe each PRF-entry has 3 bits, one each for GP-integer, CF, and SPAZO, as a way for retirement to figure out when it can free a physical register (when it retires a uop that overwrites the last architectural reference to it.)

BeeOnRope suggests that instead of full reference-counting in every PRF entry (with counters that could count up to 15 in case of mov ecx, eax / mov edx, eax / ...), the move-elimination slots effectively are reference counts.

xor-zeroing can always be eliminated because the physical zero-register never needs to be freed, so it doesn't need to be reference counted. (The existence of a physical zero-register for integer and vector is inferred from the fact that SnB-family is able to eliminate zeroing idioms as no uops.)

Related: Can x86's MOV really be "free"? Why can't I reproduce this at all? which mentions some of what Intel's optimization manual says about preferring to overwrite the result of a register copy soon, to increase the success rate of mov-elimination. But Intel at least at that time didn't mention the details of what CPU resource limit was involved.

Skylake has more mov-elimination slots than Ivy Bridge, since my testing shows it doesn't run into a bottleneck in the test-case they used to illustrate the benefit of overwriting the mov promptly.

It's really unfortunate that Intel screwed up Ice Lake / Tiger Lake and had to disable its mov-elimination (for GP-integer) with a microcode update, since overwriting the mov right away usually means it's part of the critical path latency, the opposite of what you want if you code might run on a CPU without mov-elimination. It's working again in Alder Lake and Rocket Lake.

In many cases you will overwrite both the copy and the original soonish, so it's fine to leave the destination unmodified across a few instructions. Ideally avoid leaving the copy unmodified long-term, unless it would cost more uops or make the critical path latencies worse on Ice Lake. (e.g. if you save a copy and only ever read it.) The next interrupt will usually lead to all regs getting saved/restored anyway so this isn't a problem that can "build up" even for code that has a few long-running loops with many mov-eliminated copies.

Yeah move elemination slots represent the resources in the table needed to track which arch registers map to the same physical register after a move elimination has occurred. At a minimum it needs to be consulted at retire to see if it's safe to free the preg or whether it is still referenced. There is also some complication with flags: as these are associated with pregs, but a move elimination applies only to the non-flag part, this table must remember which of the arch reg owns the associated flag or smth like that. — BeeOnRope, Jan 23 '23 at 02:18
@BeeOnRope: I guess tracking in a separate table was cheaper than reference-counting each preg? Or some other way of figuring out "no architectural state in the current retirement state maps to this preg". Since any older uops have already read the value, younger uops (or a rollback) can only possibly read it if a GP integer reg or EFLAGS references it. Mov-elim doesn't work or isn't used by multi-uop insns (like `xchg`), so extra "architectural-ish" registers for microcode use don't need to participate in this check. — Peter Cordes, Jan 23 '23 at 04:38
@BeeOnRope: Given the FLAGS vs. integer value thing, even without mov-elim, it's not as simple as having retirement free the old preg when retiring a uop that wrote a register. Although with just FLAGS as the only other possible reference, that's presumably cheap enough to just check with a small integer comparator even for 8/clock (4/logical core) retire, or 12 / 6 on Alder Lake. But checking all 16 integer regs + FLAGS would add up — Peter Cordes, Jan 23 '23 at 04:42
As I understand it, this other table _is_ effectively a reference count, but they want to treat the refs == 1 case (where no move elimination has occurred) separately from all the other cases, so they can avoid allocating the resources to do the reference counting in the first place since those are limited. So the second register which gets aliased to one preg triggers allocation in this table. At least this is what I understand from my reading. — BeeOnRope, Jan 23 '23 at 14:06
_Given the FLAGS vs. integer value thing, even without mov-elim, it's not as simple as having retirement free the old preg when retiring a uop that wrote a register. _ Yes exactly, retirement needs to be aware of this even without move elimination, but it adds complexity to move elimination too: for example a plain reference count may not work because you need to answer the question of whether any of the remaining references to this pregs are the ones that came along with flags. Since `mov` doens't copy flags, I guess only the original owner of the preg would have associated flags. — BeeOnRope, Jan 23 '23 at 14:10

How do move elimination slots work in Intel CPU?

1 Answers1

Linked

How do *move elimination* slots work in Intel CPU?

1 Answers1

Linked

How do move elimination slots work in Intel CPU?