Silly idea #1: Lookup table. That can't work in 16-bit real mode. Even a whole 64kiB segment for a table isn't enough; we need twice that to be able to look up a 2-byte result for any possible 16-bit value.
We could do this with 32-bit addressing easily, like xor ebx, ebx
/ mov bx, ax
/ mov bx, [table + ebx*2]
, if you can justify 128kiB of table data. :P
Fully within the rules, you could construct the table on the stack in 32-bit or 64-bit mode with sub esp, 1<<17
and store the data with mov word [esp+0], 0
/ mov word [esp + 2], 1
/ etc. Fully unrolled, no looping, so about 256kiB of machine code. But again, this doesn't work in real mode and is a total joke for efficiency.
We can use x86 partial register shenanigans to isolate the sign bit as a 0 / 1 integer:
xor dx, dx ; DX = 0
mov dl, ah ; DX = AX>>8 (zero extended)
add dx, dx ; DX <<= 1 shifts the sign bit alone into DH
mov dl, dh
mov dh, 0 ; DX = (AX<0) = sign bit of AX zero extended to 16-bit
neg dx ; DX = 0 or -1
Or the last 3 instructions can be optimized to 2
neg dh ; 0 or -1 according to sign bit of AX
mov dl, dh ; duplicate to the full DX = 0 or -1
Jackpot; we have our sar ax,15
or cwd
value that has all bits 0 or all bits 1, broadcasting the sign bit of AX, ready to use with a 2's complement identity (How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results?) like compilers use (https://godbolt.org/z/n3yoUp).
Typically you'd use xor ax, dx
/ sub ax, dx
to modify the original value.
I had previously thought the challenge required you to avoid modifying any other registers, otherwise the restriction on leaving AX unmodified is trivial and not worth making part of the challenge. But I don't think it's possible without extra scratch space in memory or another register. The edit clarified that there's no need for that.
mov bx, ax
xor bx, dx ; ~x or x
sub bx, dx ; ~x + 1 or x
XOR with -1
flips all the bits like NOT. XOR with 0
is a no-op.
SUB with -1
increments by 1, SUB with 0
is a no-op. (0
is the identity element for addition and xor.)
So this conditionally applies the 2's complement -x = ~x + 1
identity.
PS: It took me several minutes of head-scratching to think of this, ruling out any full-register approaches, and I'm very familiar with x86 and pretty familiar with bit-manipulation, e.g. writing codegolf.SE answers in x86 machine code and doing non-trivial stuff with SIMD. IMO this is a fun hard challenge.
Also, you'd never want to write code like this in real life; cwd
or cdq
is much more efficient, or for source regs other than AX, copy and sar
. The partial-register stuff will even cause stalls on some out-of-order execution CPUs like Intel PPro through Nehalem.
For example, on the Godbolt compiler explorer for this source:
unsigned absval(int x) {
return x<0 ? 0U - x : x;
}
Using an unsigned return value allows us to avoid signed-integer overflow undefined behaviour for the most-negative 2's complement integer. (-INT_MIN
is undefined behaviour). I think the way I wrote it actually relies on the C implementation being 2's complement, though, because 0U - x
converts x
to unsigned to match the other side before using it as an operand to binary -
. Or maybe that's what we want, for unsigned 0U-x
to produce 0x8000
from an input of 0x8000
(for 16-bit int).
GCC does this to set EAX = abs(EDI) (x86-64 System V calling convention).
mov eax, edi
cdq ; sign-extend EAX into EDX:EAX
xor eax, edx
sub eax, edx
ret
clang does this for x86-64, using a conditional move that reads flags from NEG:
mov eax, edi
neg eax ; 0 - x
cmovl eax, edi ; copy the original if 0 was < x
ret
it would have been more efficient on some CPUs to do:
; shorter critical path on CPUs where mov is not zero latency
xor eax, eax
sub eax, edi ; 0 - x
cmovl eax, edi ; copy the original if 0 was < x
ret
Sandybridge eliminates xor-zeroing but not mov, and for CPUs that don't do mov
elimination this shortens the critical path. mov eax,edi
is on the critical path but xor
-zeroing isn't. Or we could have done mov eax, edi
/ neg edi
/ cmovnl eax, edi
to again allow MOV and NEG to run in parallel.
CMOV is a 2-uop instruction on Intel CPUs before Broadwell. (CMOVA and CMOVBE are still 2 uops on current Intel because they read CF and ZF, which are renamed separately in different groups. Others are 1 uop)