How exactly does the x86 LOOP instruction work?

Question

            mov    ecx, 16
looptop:    .
            .
            .
            loop looptop

How many times will this loop execute?

What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?

What register is used as your loop counter? (hint: it looks to hold the immediate value `16`) — David C. Rankin, Oct 23 '17 at 02:53
Second hint, `CX` is known as the *count register*. (`ecx` is just the 32-bit version -- it is used to store the loop count in iterative operations, and decremented by `1` each iteration) — David C. Rankin, Oct 23 '17 at 02:55
For such simple code you can even use http://carlosrafaelgn.com.br/asm86/ (which is not complete or perfect x86 emulator (has several bugs and missing instructions), but good enough to do *this* ... open window with registers and single-step over it to see how it works (put probably `test ecx,ecx` or *anything* inside loop to better see where the execution goes). — Ped7g, Oct 23 '17 at 02:57
And if you had mov ecx, 0 then it would loop through 0 times? Or would the code break? — Hannah Duncan, Oct 23 '17 at 02:59
Right for `16`. There are many ways to do a loop. The `loop` instruction simply jumps to the label that follows (e.g. `looptop`), `ecx` number of times decrementing `ecx` by `1` each time until it is `0` and then continues with the next instruction following `loop looptop` `:)` — David C. Rankin, Oct 23 '17 at 03:01
why would 0 loop zero times? You can't loop zero times, the CPU doesn't foresee `loop` instruction and somehow skip the body loop (the CPU cares only about current instruction and it's current state, nothing else), as this is `do { ... } while()` type of loop, it will execute at least once. For `ecx=1` it will execute exactly once. For `ecx=0` the next value is of course `4294967295`, if you understand how binary math works, and why 0-1 will set all 32 bits to ones (i.e. `0xFFFFFFFF == 4294967295`). Don't use human fuzzy logic, do machine-accurate calculation. It's calculator, nothing more. — Ped7g, Oct 23 '17 at 03:04
I do understand binary, my professor just isn't that great at explaining things so I'm still fuzzy on loops. So if ecx = 0, it'll run through 4294967296 total times cause it has to loop through once to go from 0000 0000 to FFFF FFFF. — Hannah Duncan, Oct 23 '17 at 03:18
Yes, `loop` is exactly like `dec ecx / jnz`, except it doesn't set flags. Or in C, it's like the bottom of a `do{} while(--ecx != 0);` loop. If you ever want to know the details on an instruction, check the manual: http://felixcloutier.com/x86/LOOP:LOOPcc.html. And you can (and should) just try stuff in a debugger: single-step and watch registers change. See also https://stackoverflow.com/tags/x86/info for links to guides (and asm debugging tips at the bottom.) — Peter Cordes, Oct 23 '17 at 03:21
Thank you so much! I've been using a debugger to try and figure it out, but just didn't step through enough time to see that the loop did run through a finite amount of times. It makes much more sense now! — Hannah Duncan, Oct 23 '17 at 03:24
And BTW, if the instructions that aren't shown modify `ecx`, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the `loop` instruction don't modify `ecx`. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. You should [normally never use the `loop` instruction unless optimizing for code-size anyway](https://stackoverflow.com/q/35742570/224132), because it's slow. Compilers don't use it.) — Peter Cordes, Oct 23 '17 at 03:26
Tip for next time you have a similar problem: use a smaller constant so you don't have to step as many times to get to the interesting values (ecx=1). — Peter Cordes, Oct 23 '17 at 03:27
I changed your question into one that doesn't need to be downvoted / deleted. Hope that's ok. Your follow-up questions in comments basically amounted to this, but the actual question as written was to specific for such a simple "look it up in the manual" question. — Peter Cordes, Oct 23 '17 at 04:23

score 33 · Accepted Answer · edited Sep 28 '21 at 16:27

loop is exactly like dec ecx / jnz, except it doesn't set flags.

It's like the bottom of a do {} while(--ecx != 0); in C. If execution enters the loop with ecx = 0, wrap-around means the loop will run 2^32 times. (Or 2^64 times in 64-bit mode, because it uses RCX.)

Unlike rep movsb/stosb/etc., it doesn't check for ECX=0 before decrementing, only after¹.

The address-size determines whether it uses CX, ECX, or RCX. So in 64-bit code, addr32 loop is like dec ecx / jnz, while a regular loop is like dec rcx / jnz. Or in 16-bit code, it normally uses CX, but an address-size prefix (0x67) will make it use ecx. As Intel's manual says, it ignores REX.W, because that sets the operand-size, not the address-size.

rep string instructions use the address-size prefix the same way, overriding the address size but also RCX vs. ECX (or CX vs. ECX in modes other than 64-bit). The operand-size for string instructions is already used to determine movsw vs. movsd vs. movsq, and you want address/repeat size to be orthogonal to that. Having loop and jrcxz/jecxz follow that behaviour is just continuing the design intent from 8086 of loop being intended for use with string operations when a simple rep couldn't get the job done; see below.

Related: Why are loops always compiled into "do...while" style (tail jump)? for more about loop structure in asm, while() {} vs. do {} while() and how to lay them out.

Footnote 1: jcxz (or x86-64 jrcxz) was intended for use before the top of a do {} while style loop, to skip it if it should run 0 times. On modern CPUs test rcx, rcx / jz is more efficient.

Stephen Morse, architect of 8086, wrote about the intended uses of loop/jcxz with string instructions in that section of his book, The 8086 Primer, available for free on his web site: https://www.stevemorse.org/8086/index.html. See the "complex string instructions" subsection, starting at the bottom of page 71. (Or start reading from earlier in the chapter, the whole String Instructions section starts on page 66. But note @ecm's review of a few things the book seems to explain poorly or incorrectly.)

If you're wondering about the design intent of x86 instructions, you won't find a better source than this. That's separate from the best / most efficient way to use them, especially on modern x86, but very good intro for beginners into what you can do with asm instructions as building blocks.

Extra debugging tips

If you ever want to know the details on an instruction, check the manual: either Intel's official vol.2 PDF instruction set reference manual, or an html extract with each entry on a different page (http://felixcloutier.com/x86/). But note that the HTML leaves out the intro and appendices that have details on how to interpret stuff, like when it says "flags are set according to the result" for instructions like add.

And you can (and should) also just try stuff in a debugger: single-step and watch registers change. Use a smaller starting value for ecx so you get to the interesting ecx=1 part sooner. See also the x86 tag wiki for links to manuals, guides, and asm debugging tips at the bottom.

And BTW, if the instructions inside the loop that aren't shown modify ecx, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the loop instruction don't modify ecx. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. push/pop inside a loop makes your code hard to read.)

Rant about over-use of LOOP even when you already need to increment something else in the loop. LOOP isn't the only way to loop, and usually it's the worst.

You should normally never use the loop instruction unless optimizing for code-size at the expense of speed, because it's slow. Compilers don't use it. (So CPU vendors don't bother to make it fast; catch 22.) Use dec / jnz, or an entirely different loop condition. (See also http://agner.org/optimize/ to learn more about what's efficient.)

Loops don't even have to use a counter; it's often just as good if not better to compare a pointer to an end address, or to check for some other condition. (Pointless use of loop is one of my pet peeves, especially when you already have something in another register that would work as a loop counter.) Using cx as a loop counter often just ties up one of your precious few registers when you could have used cmp/jcc on another register you were incrementing anyway.

IMO, loop should be considered one of those obscure x86 instructions that beginners shouldn't be distracted with. Like stosd (without a rep prefix), aam or xlatb. It does have real uses when optimizing for code size, though. (That's sometimes useful in real life for machine code (like for boot sectors), not just for stuff like code golf.)

IMO, just teach / learn how conditional branches work, and how to make loops out of them. Then you won't get stuck into thinking there's something special about a loop that uses loop. I've seen an SO question or comment that said something like "I thought you had to declare loops", and didn't realize that loop was just an instruction.

</rant>. Like I said, loop is one of my pet peeves. It's an obscure code-golfing instruction, unless you're optimizing for an actual 8086.

I decided to post this just so we'd have a canonical answer to any future "how does `loop` work" questions. — Peter Cordes, Oct 23 '17 at 03:50
@ineedahero: feel free to stop reading after the first sentence or paragraph, then. That exactly describes its normal operation. — Peter Cordes, Mar 18 '18 at 16:54
Ha! If only "too much detail" was a common problem. Thanks @PeterCordes, I found this answer very useful and educational. I would note though that I landed here after I found loope in the instructions emitted by the .NET runtime, so I don't think it's the case that compilers don't use it. Also, if it's being emitted by the .NET runtime, it must be reasonably fast given how much time that group spends profiling and optimizing. — N8allan, Jul 06 '19 at 23:20
@N8allan: Was .NET tuning for an AMD CPU in that case? I'm curious what context, because `loop` and `loope` are still slow on Intel CPUs. (See [Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?](//stackoverflow.com/q/35742570)). Is it possible you started disassembling from something that wasn't supposed to be the start of an instruction, so your disassembly was out of sync for a while? Or are you sure it was actually running. — Peter Cordes, Jul 06 '19 at 23:33
@PeterCordes I'm running a recent generation Xeon. It's certainly possible that the instructions were out of sync but I was only looking up a few lines from a debug point I was at and I don't think I was looking past the beginning of the routine. It was a release build, so I didn't have much debugger support and I'm definitely not sure it was running. Maybe I'll get curious enough to go back to it... :-) — N8allan, Jul 07 '19 at 02:39
Because of the small instruction size and the inherent slowness of Real Mode, `loop` is still useful for initializing large tables (i.e. pages) in MBR sector 0. — yyny, Aug 03 '20 at 15:26
@yyny: right, like I said in my answer, "unless optimizing for code-size at the expense of speed". Like in a bootloader, or for [code golf.stackexchange.com](https://codegolf.stackexchange.com/q/132981) — Peter Cordes, Aug 03 '20 at 15:46
"it doesn't check for ECX=0 before decrementing, only after." Like I did in my pseudo code examples for [repeated string ops like `movsb` with a `rep` prefix](https://pushbx.org/ecm/doc/insref.htm#insMOVSB) you could list that one can add an explicit check for `(r/e)cx` zero before the first iteration, using the `jcxz` type branch instructions (or just regular `test cx, cx` \ `jz`). — ecm, Sep 28 '21 at 11:33
@ecm: Thanks, yeah added a footnote about that. And cited Stephen Morse's The 8086 Primer. — Peter Cordes, Sep 28 '21 at 14:53
@Peter Cordes: Interesting read. However, it leaves to implication that ZF is unmodified if `rep(n)e scas/cmps` is used with `cx` initially equal to zero. And the branch's relative displacement is incorrectly stated as being the distance from "the offset of the `JCXZ` instruction" (instead of actually behind it). And it states that `loop` is unconditional; plus, the flow charts say that `loop` ought to loop back to the `jcxz`. Both not exactly wrong but misleading. (It does branch "unconditionally" so to say - if after `jcxz` fell through. Looping back to `jcxz` works, it's just not optimal.) — ecm, Sep 28 '21 at 16:14
@ecm: Interesting, I hadn't read it in enough detail to notice those issues. — Peter Cordes, Sep 28 '21 at 16:17
@Peter Cordes: That section from the primer does help me understand why `cmps` is `cmp [si], [es:di]` though. That is, technically the "Source Index" pointer is used as the "destination" of the comparison subtraction. If you simulate the workings of `cmps` using the accumulator then it is like `lods` \ `scas`, or `mov A, [si]` \ `cmp A, [es:di]`. — ecm, Sep 28 '21 at 16:18

How exactly does the x86 LOOP instruction work?

1 Answers1

Extra debugging tips

Linked

Related