3

I'm trying to run a binary program that uses CMPXCHG16B instruction at one place, unfortunately my Athlon 64 X2 3800+ doesn't support it. Which is great, because I see it as a programming challenge. The instruction doesn't seem to be that hard to implement with a cave jump, so that's what I did, but something didn't work, program just froze in a loop. Maybe someone can tell me if I implemented my CMPXCHG16B wrong?

Firstly the actual piece of machine code that I'm trying to emulate is this:

f0 49 0f c7 08                lock cmpxchg16b OWORD PTR [r8]

Excerpt from Intel manual describing CMPXCHG16B:

Compare RDX:RAX with m128. If equal, set ZF and load RCX:RBX into m128. Else, clear ZF and load m128 into RDX:RAX.

First I replace all 5 bytes of the instruction with a jump to code cave with my emulation procedure, luckily the jump takes up exactly 5 bytes! The jump is actually a call instruction e8, but could be a jmp e9, both work.

e8 96 fb ff ff            call 0xfffffb96(-649)

This is a relative jump with a 32-bit signed offset encoded in two's complement, the offset points to a code cave relative to address of next instruction.

Next the emulation code I'm jumping to:

PUSH R10
PUSH R11
MOV r10, QWORD PTR [r8]
MOV r11, QWORD PTR [r8+8]
TEST R10, RAX
JNE ELSE
TEST R11, RDX
JNE ELSE
MOV QWORD PTR [r8], RBX
MOV QWORD PTR [r8+8], RCX
JMP END
ELSE:
MOV RAX, r10
MOV RDX, r11
END:
POP R11
POP R10
RET

Personally, I'm happy with it, and I think it matches the functional specification given in manual. It restores stack and two registers r10 and r11 to their original order and then resumes execution. Alas it does not work! That is the code works, but the program acts as if it's waiting for a tip and burning electricity. Which indicates my emulation was not perfect and I inadvertently broke it's loop. Do you see anything wrong with it?

I notice that this is an atomic variant of it—owning to the lock prefix. I'm hoping it's something else besides contention that I did wrong. Or is there a way to emulate atomicity too?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
vlsh
  • 175
  • 8
  • Well... you can't really gloss over contention. Also `test` is not English "*test*" it is a non destructive AND. You may consider using `cmp`. – Margaret Bloom Jul 06 '16 at 08:54
  • 2
    I somehow don't see how you will deal with atomicity of the original `lock cmpxchg16b` in MT environment with your emulation? (although I'm not sure if cmpxchg16b is atomic, even with `lock` prefix used, it's somewhat complicated around SSE and I never studied it in depth). (edit: duh, you stated it too, I'm blind today) – Ped7g Jul 06 '16 at 11:57
  • @PeterCordes are you sure? What about MFENCE? Seems to be what I need. – vlsh Jul 06 '16 at 21:57
  • (OP is replying to a comment I deleted after posting an answer) – Peter Cordes Jul 07 '16 at 19:24
  • You could write but not debug this? – Alec Teal Oct 21 '19 at 01:45

2 Answers2

6

It's not possible to emulate lock cmpxchg16b. It's sort of possible if all accesses to the target address are synchronised with a separate lock, but that includes all other instructions, including non-atomic stores to either half of the object, and atomic read-modify-writes (like xchg, lock cmpxchg, lock add, lock xadd) with one half (or other part) of the 16 byte object.

You can emulate cmpxchg16b (without lock) like you've done here, with the bugfixes from @Fifoernik's answer. That's an interesting learning exercise, but not very useful in practice, because real code that uses cmpxchg16b always uses it with a lock prefix.

A non-atomic replacement will work most of the time, because it's rare for a cache-line invalidate from another core to arrive in the small time window between two nearby instructions. This doesn't mean it's safe, it just means it's really hard to debug when it does occasionally fail. If you just want to get a game working for your own use, and can accept occasional lockups / errors, this might be useful. For anything where correctness is important, you're out of luck.


What about MFENCE? Seems to be what I need.

MFENCE before, after, or between the loads and stores won't prevent another thread from seeing a half-written value ("tearing"), or from modifying the data after your code has made the decision that the compare succeeded, but before it does the store. It might narrow the window of vulnerability, but it can't close it, because MFENCE only prevents reordering of the global visibility of our own stores and loads. It can't stop a store from another core from becoming visible to us after our loads but before our stores. That requires an atomic read-modify-write bus cycle, which is what locked instructions are for.

Doing two 8-byte atomic compare-exchanges would solve the window-of-vulnerability problem, but only for each half separately, leaving the "tearing" problem.

Atomic 16B loads/stores solves the tearing problem but not the atomicity problem between loads and stores. It's possible with SSE on some hardware, but not guaranteed to be atomic by the x86 ISA the way 8B naturally-aligned loads and stores are.


Xen's lock cmpxchg16b emulation:

The Xen virtual machine has an x86 emulator, I guess for the case where a VM starts on one machine and migrates to less-capable hardware. It emulates lock cmpxchg16b by taking a global lock, because there's no other way. If there was a way to emulate it "properly", I'm sure Xen would do that.

As discussed in this mailing list thread, Xen's solution still doesn't work when the emulated version on one core is accessing the same memory as the non-emulated instruction on another core. (The native version doesn't respect the global lock).

See also this patch on the Xen mailing list that changes the lock cmpxchg8b emulation to support both lock cmpxchg8b and lock cmpxchg16b.

I also found that KVM's x86 emulator doesn't support cmpxchg16b either, according to the search results for emulate cmpxchg16b.

I think all this is good evidence that my analysis is correct, and that it's not possible to emulate it safely.

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • After a wall of evidence, I'm convinced. MFENCE was the trickiest to understand. I didn't try SSE yet, those registers are really awkward to use for general purpose, it's a whole art form. I finished patching the rest of the application, it almost loads, but then blows up after rendering a single frame. Sometimes I get to see that frame. It's not a game, it's NASA's Eyes, I wanted to see Juno simulation. – vlsh Jul 09 '16 at 02:29
  • @vlsh: It might be possible to use MFENCE and/or SSE to make it work more often, especially if you're only interested in one specific use in one program. The other option is to just use a spinlock, if you can hook all the stores to the 16B object. (It would maybe help even if you missed some). But anyway, it sounds like a solution that only works most of the time but isn't totally safe would be at least interesting for you. I'm definitely sure that a fully safe emulation isn't possible, but you can maybe do better. Maybe even use `CPUID` to serialize the pipeline ahead of the sequence? – Peter Cordes Jul 09 '16 at 03:29
4

I see these things wrong with your code to emulate the cmpxchg16b instruction:

  • You need to use cmp in stead of test to get a correct comparison.

  • You need to save/restore all flags except the ZF. The manual mentions :

    The CF, PF, AF, SF, and OF flags are unaffected.


The manual contains the following:

IF (64-Bit Mode and OperandSize = 64)
    THEN
         TEMP128 ← DEST
         IF (RDX:RAX = TEMP128)
              THEN
                    ZF ← 1;
                    DEST ← RCX:RBX;
              ELSE
                    ZF ← 0;
                    RDX:RAX ← TEMP128;
                    DEST ← TEMP128;
                    FI;
         FI

So to really write code that "matches the functional specification given in manual" a write to the m128 is required. Although this particular write is part of the locked version lock cmpxchg16b, it won't of course do any good to the atomicity of the emulation! A straightforward emulation of lock cmpxchg16b is thus not possible. See @PeterCordes' answer.

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison

ELSE:
MOV RAX, r10
MOV RDX, r11
MOV QWORD PTR [r8], r10
MOV QWORD PTR [r8+8], r11
END:
Community
  • 1
  • 1
Fifoernik
  • 9,779
  • 1
  • 21
  • 27
  • What do you know, it was the TEST! Replaced it with CMP, loop finished, correctly or not. I have a few more instances of instruction to replace before program launches, then will know for sure... About flags, in practice it was only the *parity flag* not being restored. – vlsh Jul 06 '16 at 14:38
  • @vlsh and Fifoernik: This still doesn't make it atomic. Unconditionally doing the stores doesn't help anything, either. There's no need to emulate that quirk of the instruction. But anyway, even using SSE to load or store 128b in a single instruction isn't guaranteed to be atomic. (And might actually not be on that CPU). See also http://stackoverflow.com/questions/7646018/sse-instructions-which-cpus-can-do-atomic-16b-memory-operations – Peter Cordes Jul 06 '16 at 16:42
  • I'd upvote this if you edit it to make it clear that you're just emulating `cmpxchg16b`, *not* `lock cmpxchg16b`. I still say the whole second half about emulating the unconditional store is pointless, too. I don't think dirtying the line in the cache can produce any observable results, and an extra non-atomic read-rewrite is NOT part of the operation of `cmpxchg16b` without a `lock` anyway. (That quote is from a paragraph describing the LOCKed version only). – Peter Cordes Jul 07 '16 at 19:22
  • Hmm, I think I misread it before. The operation section does show the write as always happening, even in the non-locked case. Now I understand why they bother to mention that in the cmpxchg / cmpxchg8b instruction descriptions. – Peter Cordes Jul 09 '16 at 16:07