string comparison in 8086

Question

I have problem with this question. I don't know what it wants from me.

Question : Write a procedure that compares a source string at DS:SI to a destination string at ES:DI and sets the flags accordingly. If the source is less than the destination, carry flag is set. if string are equal , the zero flag is set. if the source is greater than the destination , the zero and carry flags are both cleared.

My Answer :

MOV ESI , STRING1
MOV EDI, STRING2
MOV ECX, COUNT
CLD
REPE CMPSB

Still I am not sure about it. Is it true or should I try something else ?

p.s: I don't understand why people vote down this question. What is wrong with my question ? I think we are all here for learning. Or not ? Miss I something ?

I don't know either, but don't you just have the means to try and run it at the ready? If not, such means would be worth pursuing - provided that you will have more questions on the topic. — Gassa, Nov 08 '15 at 09:52
Well, how are the zero and carry flags for different strings after *running* the code above? Are they right? Then you're done. Are they wrong but still have the information for all 3 differnt outcomes? Then you just need to add some more code to fix the flags to match the requirement. Are you not having enough information, so you can't detect all 3 different outcomes? Then you need to change the code completely or add more code to differentiate all the 3 possibilities and *then* make it so that the flags match the reqs. — hyde, Nov 08 '15 at 09:53
8086 is 16 bit .... so you do not have 32 bit registers. And instead of `cld` I would chose `cmp al,al` to clear the Carry and set Zero to handle case when `COUNT=0` — Spektre, Nov 08 '15 at 09:57
Oh, also "write a procedure", this probably means you have to write a callable subroutine which returns the result in flags? If that is the case (you should know based on your course material), you should include your complete procedure in the question, including entry/label and return... (Of course this can get problematic, if you post your answers publicly in the internet, and others in your course copy it... Homework problems at SO can be problematic like this.) — hyde, Nov 08 '15 at 09:57
@hyde +1 and just need to add there may or may not be also some calling convention taking operands from stack ... — Spektre, Nov 08 '15 at 10:00
@Spektre Well, the terminology is a bit blurry I think. 8086 can be used to mean "real mode", where you can use 32 bit registers (assuming you have at least 80386 processor). But x86 is a better tag. — hyde, Nov 08 '15 at 10:05
@hyde, actually it is not homework, it is a question of past year exam and I am studying for my exam now. I wrote the question completely. There is no label or something else. — earlymorningtea, Nov 08 '15 at 10:15
@Gassa, my problem is should I write anything else more, because I know CMPS instruction set flags, why should I set them again. If I compare these strings with CMPS , the flags will be set already. — earlymorningtea, Nov 08 '15 at 10:17
@hyde: Some asm courses teach asm with emu8086, so they have to learn segmented memory, and the limited addressing mode selection where the registers are non-orthogonal, and there's no movzx / movsx. Questions where you're limited to actual 8086 should be tagged x86 and 8086. Questions about 16bit DOS (`int 21h`) code should be tagged x86 and dos. Questions about normal asm that runs as (part of) a 32 or 64 bit process should be tagged just x86. — Peter Cordes, Nov 08 '15 at 17:29
@tea-addict: so use a debugger and look at the flags after running this, obviously. Or write code that does a `setcc` or conditional branch based on flags, like an actual caller would do after a call to your function. Also, if the problem statement says the pointers are already in `SI` and `DI` when you're called, you shouldn't clobber them. 32bit x86 usually uses a stack calling convention, but passing args in registers works fine, and that's how both the Windows and Linux/Mac x86-64 ABIs work. — Peter Cordes, Nov 08 '15 at 17:32
@PeterCordes thank you , I understand now which way I should go to solution :) — earlymorningtea, Nov 09 '15 at 19:14

Peter Cordes · Accepted Answer · 2019-03-15T01:57:11.507

If the problem statement says the pointers are already in SI and DI when you're called, you shouldn't clobber them.

16-bit code often doesn't stick to a single calling convention for all functions, and passing (the first few) args in registers is usually good (fewer instructions, and avoids store/reload). 32-bit x86 calling conventions usually use stack args, but that's obsolete. Both the Windows x64 and Linux/Mac x86-64 System V ABIs / calling conventions use register args.

The problem statement doesn't mention a count, though. So you're implementing strcmp for strings terminated by a zero-byte, rather than memcmp for known-length blocks of memory. You can't use a single rep instruction, since you need to check for non-equal and for end-of-string. If you just pass some large size, and the strings are equal, repe cmpsb would continue past the terminator.

repe cmpsb is usable if you know the length of either string. e.g. take a length arg in CX to avoid the problem of running past the terminator in both strings.

But for performance, repe cmpsb isn't fast anyway (like 2 to 3 cycles per compare, on Skylake vs. Ryzen. Or even 4 cycles per compare on Bulldozer-family). Only rep movs and rep stos are efficient on modern CPUs, with optimized microcode that copies or stores 16 (or 32 or 64) bytes at a time.

There are 2 major conventions for storing strings in memory: Explicit-length strings (pointer + length) like C++ std::string, and implicit length strings where you just have a pointer, and the end of string is marked by a sentinel / terminator. (Like C char* that uses a 0 byte, or DOS string-print functions that use '$' as a terminator.)

A useful observation is that you only need to check for the terminator in one of the strings. If the other string has a terminator and this one doesn't, it will be a mismatch.

So you want to load a byte into a register from one string, and check it for the teminator and against memory for the other string.

(If you need to actually use ES:DI instead of just DI with the default DS segment base, you can use cmp al, [es: bx + di] (NASM syntax, adjust as needed like maybe cmp al, es: [bx + di] I think). Probably the question intended for you to use lodsb and scasb, because scasb uses ES:DI.)

;; inputs:  str1 pointer in DI, str2 pointer in SI
;; outputs: BX = mismatch index, or one-past-the-terminators.
;;          FLAGS:  ZF=1 for equal strings (je),  ZF=0 for mismatch (jne)
;; clobbers: AL (holds str1's terminator or mismatching byte on return)
strcmp:
    xor    bx, bx
.innerloop:                 ; do { 
    mov    al, [si + bx]        ; load a source byte
    cmp    al, [di + bx]        ; check it against the other string
    jne   .mismatch             ; if (str1[i] != str2[i]) break;

    inc    bx                   ; index++

    test   al, al               ; check for 0.  Use cmp al, '$' for a $ terminator
    jnz   .innerloop        ; }while(str1[i] != terminator);

    ; fall through (ZF=1) means we found the terminator
    ; in str1 *and* str2 at the same position, thus they match

.mismatch:               ; we jump here with ZF=0 on mismatch
    ; sete  al           ; optionally create an integer in AL from FLAGS
    ret

Usage: put pointers in SI/DI, call strcmp / je match, because the match / non-match state is in FLAGS. If you want to turn the condition into an integer, 386 and later CPUs allow sete al to create a 0 or 1 in AL according to the equals condition (ZF==1).

Using sub al, [mem] instead of cmp al, [mem], we'd get al = str1[i] - str2[i], giving us a 0 only if the strings matched. If your strings only hold ASCII values from 0..127, that can't cause signed overflow, so you can use it as a signed return value that actually tells you which string sorts before/after the other. (But if there could be high-ASCII 128..255 bytes in the string, we'd need to zero- or sign-extend to 16-bit first to avoid signed overflow for a case like (unsigned)5 - (unsigned)254 = (signed)+7 because of 8-bit wraparound.

Of course, with our FLAGS return value, the caller can already use ja or jb (unsigned compare results), or jg / jl if they want to treat the strings as holding signed char. This works regardless of the range of input bytes.

Or inline this loop so jne mismatch jumps somewhere useful directly.

16-bit addressing modes are limited, but BX can be a base, and SI and DI can both be indices. I used an index increment instead of inc si and inc di. Using lodsb would also be an option, and maybe even scasb to compare it to the other string. (And then check the terminator.)

Performance

Indexed addressing modes can be slower on some modern x86 CPUs, but this does save instructions in the loop (so it's good for true 8086 where code-size matters). Although to really tune for 8086, I think lodsb / scasb would be your best bet, replacing the mov load and cmp al, [mem], and also the inc bx. Just remember to use cld outside the loop if your calling convention doesn't guarantee that.

If you care about modern x86, use movzx eax, byte [si+bx] to break the false dependency on the old value of EAX, for CPUs that don't rename partial registers separately. (Breaking the false dep is especially important if you use sub al, [str2] because that would turn it into a 2-cycle loop-carried dependency chain through EAX, on CPUs other than PPro through Sandybridge. IvyBridge and later doesn't rename AL separately from EAX, so mov al, [mem] is a micro-fused load+merge uop.)

If the cmp al,[bx+di] micro-fuses the load, and macro-fuses with jne into one compare-and-branch uop, the whole loop could be only 4 uops total on Haswell, and could run at 1 iteration per clock for large inputs. The branch mispredict at the end will make small-input performance worse, unless branching goes the way way every time for a small enough input. See https://agner.org/optimize/. Recent Intel and AMD can do 2 loads per clock.

Unrolling could amortize the inc bx cost, but that's all. With a taken + not-taken branch inside the loop, no current CPUs can run this faster than 1 cycle per iteration. (See Why are loops always compiled into "do...while" style (tail jump)? for more about the do{}while loop structure). To go faster we'd need to check multiple bytes at once.

Even 1 byte / cycle is very slow compared to 16 bytes per 1 or 2 cycles with SSE2 (using some clever tricks to avoid reading memory that might fault).

See https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for more about using x86 SIMD for string compare, and also glibc's SSE2 and later optimized string functions.

GNU libc's fallback scalar strcmp implementation looks decent (translated from AT&T to Intel syntax, but with the C preprocessor macros and stuff left in. L() makes a local label).

It only uses this when SSE2 or better isn't available. There are bithacks for checking a whole 32-bit register for any zero bytes, which could let you go faster even without SIMD, but alignment is a problem. (If the terminator could be anywhere, you have to be careful when loading multiple bytes at once to not read from any memory pages that you aren't sure contain at least 1 byte of valid data, otherwise you could fault.)

strcmp:
         mov    ecx,DWORD PTR [esp+0x4]
         mov    edx,DWORD PTR [esp+0x8]     # load pointer args

L(oop): mov     al,BYTE PTR [ecx]          # movzx eax, byte ptr [ecx] would be avoid a false dep
        cmp     al,BYTE PTR [edx]
        jne     L(neq)
        inc     ecx
        inc     edx
        test    al, al
        jnz     L(oop)

        xorl    eax, eax
        /* when strings are equal, pointers rest one beyond
           the end of the NUL terminators.  */
        ret

L(neq): mov    eax,  1
        mov    ecx, -1
        cmovb  eax, ecx
        ret

string comparison in 8086

1 Answers1

Performance

Linked