Trouble with cmpsb in x86 Assembly

Question

I'm having a bit of trouble understanding what this assembly code is doing. The function takes in an integer argument and I am supposed to figure out what arguments will allow me to take the jump at the end of this block. However, I am completely lost on what the cmpsb is doing. I don't know what %es and %dsare or what is being compared (the argument and 0x8048bb0 as strings?).

lea    0x1e(%esp),%edi    //loads the integer argument into edi register
mov    $0x8048bb0,%esi    //moves 0x8048bb0 into esi register
mov    $0xd,%ecx          //moves 13 into ecx register
repz cmpsb %es:(%edi),%ds:(%esi)  //a loop and a string byte comparison?
seta   %dl                //sets dl if previous comparison results in >
setb   %al                //sets al if previous comparison results in <
cmp    %al,%dl            //compares al and dl
je     8048851 <phase_3_of_5+0x69>  //jumps if al and dl are equal (meaning the above comparison was equal)

I've tried searching for how cmpsb works but I can't find anything that has it laid out like this.

Furthermore, what's the point of the seta and setb? If it's just checking if the cmpsb was equal why couldn't the je just come after that?

[First Web search hit for cmpsb](http://faydoc.tripod.com/cpu/cmpsb.htm) explains. — Raymond Chen, Feb 24 '16 at 06:41
I've already read that but I don't understand it. What does it mean by "byte at address DS:(E)SI"? I mentioned that I don't know what `%es` and `%ds` are and I also don't know what the `:` does here. That also doesn't help me understand what is actually being compared here. Why is it doing a string comparison on an integer and a hex value? — hbgoddard, Feb 24 '16 at 06:51
`es` and `ds` are segment registers. But you can ignore those here, as they just look like a biproduct of some verbose disassembler. The instruction is comparing the byte pointed to by register `edi` with the byte pointed to by register `esi`. It will keep doing so as long as the two bytes are equal (because of `repz`), and after each comparison `esi` and `edi` will either be incremented or decremented depending on the value of the direction flag. — Michael, Feb 24 '16 at 07:10
@Michael Even though the default segment registers wouldn't be needed for the command, they can't just be ignored if the code is to be understood. And the DS can be even overridden. So the instruction doesn't compare byte pointed to by register EDI, it uses ES:EDI. REPZ also stops when ECX gets to zero, if another condition is not met before. It will not keep going as long as the bytes are equal. — Sami Kuhmonen, Feb 24 '16 at 08:11

score 2 · Accepted Answer · answered Feb 24 '16 at 08:24

lea    0x1e(%esp),%edi    //loads a pointer to a string from stack to EDI
mov    $0x8048bb0,%esi    //loads a pointer to another string to ESI
mov    $0xd,%ecx          //moves 13 into ECX register
repz cmpsb %es:(%edi),%ds:(%esi)  //loops comparing values in ES:EDI and DS:ESI, continues until ECX is zero or a mismatching byte is found
seta   %dl                //sets DL if previous comparison results in >
setb   %al                //sets AL if previous comparison results in <
cmp    %al,%dl            //compares AL and DL
je     8048851 <phase_3_of_5+0x69>  //jumps if AL and DL are equal (meaning the above comparison was equal)

The ES and DS are segment registers and the : separates the segment register from the offset. In 16bit world they were simple to use: address DS:SI is DS*16 + SI. In 32/64bit segmented world they are more complicated, they are pointers to a segment table and the physical address is derived from there. The main point to understand is that you always need a segment and an offset register to point to anything. The segment register may be hidden from the assembly code but it's always there.

The segment registers also mean that even though ESI would be equal to EDI, but you use DS:ESI and ES:EDI they may point to a different memory location. And of course DS:ESI and ES:ESI can point to a different location.

The code doesn't compare a string and an integer. Both ESI and EDI contain an integer value, but they are used as pointers to a memory location and the bytes in those locations are compared to each other using the CMPSB command. REPZ continues until the comparison is not zero, or ECX becomes zero.

The SETA/SETB instructions are quite useless, unless their values are used later on. The code could just check for zero flag being set, which means all comparisons were equal. So all 13 bytes pointed to by ES:EDI and DS:ESI were equal.

The code does not set the direction flag, so it's not clear if the bytes to be compared are from ES:EDI onwards or backwards. Logical thing would be to go forwards.

score 1 · Answer 2 · edited May 23 '17 at 12:32

Note: this answer talks about the x86 32-bit architecture (80386DX and higher). While the 16-bit architecture (8086 - 80286) is similar, it is inherently different nevertheless. Read the Intel 64 and IA-32 Architectures Software Developer's Manual for further information.

Furthermore, I'm using Intel syntax here. If AT&T syntax, as used in your question, is more familiar to you, tell me and I'll adjust my answer accordingly.

x86 processors have a certain set of registers.
From vol. 1, §3.4:

General-purpose registers. These eight registers are available for storing operands and pointers.

Segment registers. These registers hold up to six segment selectors.

EFLAGS (program status and control) register. The EFLAGS register report on the status of the program being executed and allows limited (application-program level) control of the processor.

EIP (instruction pointer) register. The EIP register contains a 32-bit pointer to the next instruction to be executed.

_{(Code formatters added.)}

Segment registers contain a segment selector, which points to a segment descriptor in the Global Descriptor Table, which in turn describes a segment of linear (a.k.a. virtual)¹ memory. It's complex and lengthy, so I won't delve into great detail about it here. Read the manual if you want to know more.
The colon (:) here is just a notation for the segment-offset combination.
Moreover, you don't need to fret about segmentation in a user program because it is completely handled by the OS and the value usually stays the same throughout the program's runtime.

Now that you roughly know what segment registers are, I'll explain the instruction itsself.
From vol. 1, §7.3.9.1:

[...]

The string elements to be operated on are identified with the ESI (source string element) and EDI (destination string element) registers. Both of these registers contain absolute addresses (offsets into a segment) that point to a string element.

By default, the ESI register addresses the segment identified with the DS segment register. A segment-override prefix allows the ESI register to be associated with the CS, SS, ES, FS, or GS segment register. The EDI register addresses the segment identified with the ES segment register; no segment override is allowed for the EDI register. The use of two different segment registers in the string instructions permits operations to be performed on strings located in different segments.

[...]

The CMPS instruction subtracts the destination string element from the source string element and updates the status flags (CF, ZF, OF, SF, PF, and AF) in the EFLAGS register according to the results. Neither string element is written back to memory. The assembler recognizes three “short forms” of the CMPS instruction: CMPSB (compare byte strings), CMPSW (compare word strings), and CMPSD (compare doubleword strings).

_{(Code formatters added.)}

Long story short: CMPS performs a CMP with DS:ESI and ES:EDI as operands. Interesting to note is that CMP alone cannot compare two memory operands. However, CMPS can.
Some instructions assume registers implicitly. The string instructions fall into that category. They automatically work on ESI and EDI and only a segment override prefix is allowed (so it's not DS:ESI but FS:ESI, for instance). Another example for implicit operands is SHR. SHR AX will shift AX one bit to the right. However, in that case, it's rationale lies in history: the first x86 CPUs knew only shifting one bit or CL bits. Immediate operands were introduced later, so SHR AX would be used back then to shift one bit, equivalently to SHR AX, 1.
But why does the assembler (presumably GNU as) print the source and destination operands anyway? Good question, I can't tell for sure either. Maybe to display possible segment override prefixes.

Let's talk about the prefix REPZ now.
From vol. 1, §7.3.9.2:

The following repeat prefixes can be used in conjunction with a count in the ECX register to cause a string instruction [hyphen removed] to repeat:

REP — Repeat while the ECX register not zero.

REPE/REPZ — Repeat while the ECX register not zero and the ZF flag is set.

REPNE/REPNZ — Repeat while the ECX register not zero and the ZF flag is clear.

_{(Code formatters added.)}

So, REPZ CMPSB instruction repeats CMPSB as long as ECX is not zero and ZF (Zero Flag) is set.
From vol. 1, §3.4.3.1:

Zero flag — Set if the result is zero; cleared otherwise.

From this and because a result being zero indicates equality, we can deduce that REPZ CMPSB runs as long as BYTE PTR [DS:ESI] equals BYTE PTR [ES:EDI], ECX times. This means that when the instruction has finished, it either points to the first unequal BYTE PTR [DS:ESI]-BYTE PTR [ES:EDI] pair or to the bytes after the last ones in the string of bytes (in case ECX has reached zero).

_{To be continued with SETA and SETB instructions soon.}

All quotations refer to the Intel 64 and IA-32 Architectures Software Developer's Manual.

¹ For the difference between physical, logical, and virtual addresses, see here.

Trouble with cmpsb in x86 Assembly

2 Answers2