x86 rep prefix with a count of zero: what happens?

Question

What happens for an initial count of zero for an x86 rep prefix?

Intel's manual says explicitly it’s a while count != 0 loop with the test at the top, which is the sane expected behaviour.

But most of the many vague reports I’ve seen elsewhere suggest that there’s no initial test for zero so it would be like a countdown with a test at the end and so disaster if it’s repeat {… count —=1; } until count == 0; or who knows.

I put in an explicit conditional branch at the top and I would love to remove it if I can. — Cecil Ward, Jun 01 '23 at 05:39
z80 had no initial test for its string instructions, maybe you heard it in that context? — harold, Jun 01 '23 at 09:08
I’m an old Z80 professional asm programmer. So maybe fear has remained in my heart. — Cecil Ward, Jun 01 '23 at 16:15

Peter Cordes · Accepted Answer · 2023-06-01T14:24:49.830

Nothing happens with RCX=0; rep prefixes do check for zero first like the pseudocode says. (Unlike the loop instruction which is exactly like the bottom of a do{}while(--ecx), or a dec rcx/jnz but without affecting FLAGS.)

I think I've heard of this rarely being used as an idiom for a conditional load or store with rep lodsw or rep stosw with a count of 0 or 1, especially in the bad old days before cmov. (cmov is an unconditional load feeding an ALU select operation, so it needs a valid address, unlike rep lods with a count of zero.) This is not efficient especially for rep stos on modern x86 with Fast Strings microcode (P6 and later), especially without anything like Fast Short Rep-Movs (Ice Lake IIRC.)

The same applies for instructions that treat the prefixes as repz / repnz (cmps/scas) instead of unconditional rep (lods/stos/movs). Doing zero iterations means they leave FLAGS umodified.

If you want to check FLAGS after a repe/ne cmps/scas, you need to make sure the count was non-zero, or that FLAGS was already set such that you'll branch in a useful way for zero-length buffers. (Perhaps from xor-zeroing a register that you're going to want later.)

rep movs and rep stos have fast-strings microcode on CPUs since P6, but the startup overhead makes them rarely worth it, especially when sizes can be short and/or data might be misaligned. They're more useful in kernel code where you can't freely use XMM registers. Some recent CPUs like Ice Lake have fast-short-rep microcode that I think is supposed to reduce startup overhead for small counts.

repe/ne scas/cmps do not have fast-strings microcode on most CPUs, only on very recent CPUs like Sapphire Rapids and maybe Alder Lake P-cores. So they're quite slow, like one load per clock cycle (so 2 cycles per count for cmpsb/w/d/q) according to testing by https://agner.org/optimize/ and https://uops.info/.

What setup does REP do?
Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled? - GCC -O1 used to use repne scasb to inline strlen. This is a disaster for long strings.
Which processors support "Fast Short REP CMPSB and SCASB" (very recent feature)
Enhanced REP MOVSB for memcpy - even without ERMSB, rep movs will use no-RFO stores for large sizes, similar to NT stores but not bypassing the cache. Good general Q&A about memory bandwidth considerations.

`repe` and `repne` both work the same way, if the counter is initially zero then the repeated instructions are skipped without any effect. **Importantly**, the flags are not modified by a `cmps` or `scas` instruction skipped in this way. So if you want to branch on eg the Zero Flag then you have to make sure that the counter is never zero initially, or that the Zero Flag is pre-initialised the way you want, eg in https://hg.pushbx.org/ecm/ldebug/file/eb3730a8118e/source/expr.asm#l3505 — ecm, Jun 01 '23 at 06:17
@ecm: Thanks, I think the FLAGS-unmodified possible gotcha was what I was trying to remember as being "special" about a count of zero for rep[n]e cmps/scas. Update. — Peter Cordes, Jun 01 '23 at 14:25
Strange, I could have sworn that `rep` when `cx=0` didn't terminate immediately but rather wrapped around to `ffff...` and had to reach zero again. — puppydrum64, Jun 29 '23 at 17:29
@puppydrum64 No, the `rep` type prefixes do "check" for a zero counter initially. This is why the pseudo-code in https://pushbx.org/ecm/doc/insref.htm#insSCASB and the other string instructions start with `jcxz` (for `a16`) before the loop, instead of only including a `loop` branch. Of course that isn't a good source because I added those examples, but I believe the vendor manuals by AMD or Intel would confirm this. Or try it in a debugger. — ecm, Jun 30 '23 at 16:33
Chances are I'm also getting x86 and Z80 mixed up, I tend to do that very often. — puppydrum64, Jul 20 '23 at 13:11

x86 rep prefix with a count of zero: what happens?

1 Answers1