Absolute value without branching or shifts, only add/sub and booleans

Question

We got this problem at school for students that want to test themselves. I have spent quite some time on this but can't figure it out.

You have 16-bit number in AX register, this number is signed. Get its absolute value, the number in AX has to be unchanged (EDIT: There isn't liminted number of registers, and the AX register can be changed but at the end of the function it needs to be the original number), and the answer should be in BX. You can only use these instructions:
MOV, ADD, XOR, SUB, NOT, AND, OR, NEG.

It is quite easy to do with SAR the way compilers do, but without it it's not clear how to get any behaviour conditional on the sign bit.

the absolute value of the smallest value -2^16 can't be represented in 16 bits? — qwr, Apr 12 '20 at 22:46
@qwr: The solution to that problem is that the input is signed, the result is unsigned. So `0x8000` (`32768`) is the correct absolute value result for an input of `0x8000` (`-32768`). This is easy in asm, a pain in the ass in C where you might have to write something like `(x < 0) ? 0U - x : (unsigned)x;` to avoid causing signed overflow UB. But that converts `x` to unsigned *before* the subtract, so it might only be right for 2's complement. IDK, C is really hard to use sometimes. — Peter Cordes, Apr 12 '20 at 23:07
Related: [x86 assembly abs() implementation?](https://stackoverflow.com/q/2639173) for efficient code without these restrictions on what instructions you can use. (And the bottom half of my answer here.) — Peter Cordes, Sep 21 '22 at 21:12

Peter Cordes · Answer 1 · 2020-04-13T00:33:43.777

Silly idea #1: Lookup table. That can't work in 16-bit real mode. Even a whole 64kiB segment for a table isn't enough; we need twice that to be able to look up a 2-byte result for any possible 16-bit value.

We could do this with 32-bit addressing easily, like xor ebx, ebx / mov bx, ax / mov bx, [table + ebx*2], if you can justify 128kiB of table data. :P

Fully within the rules, you could construct the table on the stack in 32-bit or 64-bit mode with sub esp, 1<<17 and store the data with mov word [esp+0], 0 / mov word [esp + 2], 1 / etc. Fully unrolled, no looping, so about 256kiB of machine code. But again, this doesn't work in real mode and is a total joke for efficiency.

We can use x86 partial register shenanigans to isolate the sign bit as a 0 / 1 integer:

    xor  dx, dx           ; DX = 0
    mov  dl, ah           ; DX = AX>>8   (zero extended)
    add  dx, dx           ; DX <<= 1  shifts the sign bit alone into DH

    mov  dl, dh
    mov  dh, 0            ; DX = (AX<0) = sign bit of AX zero extended to 16-bit

    neg  dx               ; DX = 0 or -1

Or the last 3 instructions can be optimized to 2

    neg  dh               ; 0 or -1 according to sign bit of AX
    mov  dl, dh           ; duplicate to the full DX = 0 or -1

Jackpot; we have our sar ax,15 or cwd value that has all bits 0 or all bits 1, broadcasting the sign bit of AX, ready to use with a 2's complement identity (How to prove that the C statement -x, ~x+1, and ~(x-1) yield the same results?) like compilers use (https://godbolt.org/z/n3yoUp).

Typically you'd use xor ax, dx / sub ax, dx to modify the original value.

I had previously thought the challenge required you to avoid modifying any other registers, otherwise the restriction on leaving AX unmodified is trivial and not worth making part of the challenge. But I don't think it's possible without extra scratch space in memory or another register. The edit clarified that there's no need for that.

    mov  bx, ax
    xor  bx, dx           ; ~x      or x
    sub  bx, dx           ; ~x + 1  or x

XOR with -1 flips all the bits like NOT. XOR with 0 is a no-op.

SUB with -1 increments by 1, SUB with 0 is a no-op. (0 is the identity element for addition and xor.)

So this conditionally applies the 2's complement -x = ~x + 1 identity.

PS: It took me several minutes of head-scratching to think of this, ruling out any full-register approaches, and I'm very familiar with x86 and pretty familiar with bit-manipulation, e.g. writing codegolf.SE answers in x86 machine code and doing non-trivial stuff with SIMD. IMO this is a fun hard challenge.

Also, you'd never want to write code like this in real life; cwd or cdq is much more efficient, or for source regs other than AX, copy and sar. The partial-register stuff will even cause stalls on some out-of-order execution CPUs like Intel PPro through Nehalem.

For example, on the Godbolt compiler explorer for this source:

unsigned absval(int x) {
    return x<0 ? 0U - x : x;
}

Using an unsigned return value allows us to avoid signed-integer overflow undefined behaviour for the most-negative 2's complement integer. (-INT_MIN is undefined behaviour). I think the way I wrote it actually relies on the C implementation being 2's complement, though, because 0U - x converts x to unsigned to match the other side before using it as an operand to binary -. Or maybe that's what we want, for unsigned 0U-x to produce 0x8000 from an input of 0x8000 (for 16-bit int).

GCC does this to set EAX = abs(EDI) (x86-64 System V calling convention).

    mov     eax, edi
    cdq                      ; sign-extend EAX into EDX:EAX
    xor     eax, edx
    sub     eax, edx
    ret

clang does this for x86-64, using a conditional move that reads flags from NEG:

    mov     eax, edi
    neg     eax                 ; 0 - x
    cmovl   eax, edi            ; copy the original if 0 was < x
    ret

it would have been more efficient on some CPUs to do:

    ; shorter critical path on CPUs where mov is not zero latency
    xor     eax, eax
    sub     eax, edi            ; 0 - x
    cmovl   eax, edi            ; copy the original if 0 was < x
    ret

Sandybridge eliminates xor-zeroing but not mov, and for CPUs that don't do mov elimination this shortens the critical path. mov eax,edi is on the critical path but xor-zeroing isn't. Or we could have done mov eax, edi / neg edi / cmovnl eax, edi to again allow MOV and NEG to run in parallel.

CMOV is a 2-uop instruction on Intel CPUs before Broadwell. (CMOVA and CMOVBE are still 2 uops on current Intel because they read CF and ZF, which are renamed separately in different groups. Others are 1 uop)

Our profesor gives this out every semester, and he said that over the 10 years he's been there only 7 people ware able to do it, and there is like 800 freshmen every semester. I did it with SAR quite simple: `mov ax, [esp+8] mov bx, ax sar bx, 15 mov cx, ax add cx, bx xor bx, cx` — Adam, Apr 12 '20 at 23:03
the OP was edited to say I believe you can use beyond two registers — qwr, Apr 12 '20 at 23:06
@qwr: Yeah I noticed that; that makes the 2nd part not a challenge. I might edit later when I have time. — Peter Cordes, Apr 12 '20 at 23:11
@Peter Cordes: I'm sorry i wasnt sure, so i had to ask the professor, it wasnt specified if it was enabled or not. I'm not super fimiliar with bitwise operations, so your explanation was very helpful, thanks again. — Adam, Apr 12 '20 at 23:22
@AdamĎuriník: Updated, glad to hear there wasn't supposed to be a way to do it without any scratch space that I was missing. 7 people over 10 years sounds about right for first-year students that are mostly beginners in asm coming up with this! Thanks for passing that along. — Peter Cordes, Apr 13 '20 at 00:28

score 0 · Answer 2 · edited Apr 12 '20 at 23:50

0

So thanks to Peter Cordes's answer, the code is quite simple, the problem was the SAR instruction, but Peter created it quite nicely.

The number is in already loaded to AX

; this is practicaly the SAR instruction, 
; the mask for the absolute value is 
; number >> (sizeof(short)) * CHAR_BIT -1)
; number >>        (2 * 8) - 1 = 15
; so normaly we would do SAR bx, 15 and done

mov bl, ah  ; BX = AX>>8
add bx, bx  ; BX <<= 1
neg bh      ; 0 or -1 
mov bl, bh  ; duplicate the full BX

mov cx, ax  ;
add cx, bx  ; the number + mask 
xor bx, cx  ; (number+mask) ^ mask

now the answer is in BX, and AX was not altered

edited Apr 12 '20 at 23:50

Peter Cordes

328,167
45
605
847

answered Apr 12 '20 at 23:19

Adam

21
4

"normally" you'd `cwd` to sign-extend AX into DX:AX, i.e. set DX = 0 or -1. Anyway, in the first version of the question I though we weren't allowed to use any other registers, otherwise the restriction on not modifying AX is trivial. – Peter Cordes Apr 12 '20 at 23:46
1

I think you need an `xor bx, bx` at the beginning. – Chris Hall Apr 13 '20 at 12:54

Absolute value without branching or shifts, only add/sub and booleans

2 Answers2

Linked