BITswap assembly program

Question

%include "asm_io.inc"
segment .data
outmsg1 db    "Input Integer: ", 0
outmsg2 db    "After Operation": ", 0

segment .bss

input1  resd 1
input2  resd 1

segment .text
    global  asm_main
asm_main:
    enter 0,0
    pusha

    mov eax, outmsg1
    call print_string
    call read_int
    call print_nl
    dump_regs 1

    rol eax, 8
    rol ax, 8
    ror eax,8
    mov ebx, 0
    mov ebx, eax
    dump_regs 2

    popa
    mov eax, 0
    leave
    ret

given the above assembly program which swaps the highest value Byte with the lowest value byte of a given integer. im trying to figure out how to make it swap the highest value BIT with the lowest value BIT.

im somewhat stuck, maybe you can help me out

You can get idea for one possible solution here: https://stackoverflow.com/a/42246795/4271923 (although the parity flag is updated only by low 8 bits of operation, so to use this strategy of `test + jpe + xor` directly, you will have to first move top bit into bottom 8, and then back.. like `rol eax,1` `test eax,3` `jpe to_ror` `xor eax,3` `ror eax,1`) ... other option is to use few more registers to extract either each bit separately and recompose back. The extract+compose may be faster due to avoiding branching, but will clobber more registers and will take few more lines. — Ped7g, Aug 27 '18 at 09:55
I don't think that `rol`/`ror` sequence is a byte swap. It looks like you go `abcd -> bcda -> bcad -> dbca`. So you miss swapping the middle two bytes. The normal way ([Reverse byte order of EAX register](https://stackoverflow.com/a/51954830)) other than `bswap` is with `rol ax,8` / `rol eax,16` / `rol ax,8`. And BTW, x86 doesn't have a bit-reverse instruction, but fun fact: ARM does. (`rbit`). Easy for hardware to provide this operation efficiently. On x86, your best bet is with SSSE3 `pshufb` for 4-bit lookups of each half of each byte in parallel. — Peter Cordes, Aug 27 '18 at 12:15
AVX2 with intrinsics version of what I suggested: [avx2 register bits reverse](https://stackoverflow.com/q/46317534). Look at compiler output if you want asm. Use `movd xmm0, eax` before and `movd eax, xmm0` after. — Peter Cordes, Aug 27 '18 at 12:21
@PeterCordes I think the byte swap version doesn't "miss middle two bytes", because that was never intended. *"program which swaps the highest value Byte with the lowest value byte"* (only low/high byte swapped). Also the bit swap is requested only for top and bottom bit. I have strong suspicion this is not practical task, but just exercise of bit/byte manipulation, which seems fine to me. — Ped7g, Aug 27 '18 at 13:31

fuz · Answer 1 · 2018-08-27T14:04:45.443

3

How about this? The input is in ecx or any other register you like.

                   // initial state: ECX = A...B
ror ecx            // ECX = BA..., CF = B
bt ecx, 30         // ECX = BA..., CF = A
rcl ecx, 2         // ECX = ...AB, CF = A
ror ecx            // ECX = B...A

As Peter Cordes indicated, if you are optimizing for performance you might want to change the code like this:

ror ecx
bt ecx, 30
adc ecx, ecx
adc ecx, ecx
ror ecx

This is because adc r,r performs better than rcl r,imm or rcl r on modern microarchitectures.

edited Aug 27 '18 at 14:04

answered Aug 27 '18 at 10:04

fuz

88,405
25
200
352

@tamut No, this only swaps the highest with the lowest bit. No other bits are swapped. Read the code carefully to understand how it works. – fuz Aug 27 '18 at 10:10
1

Beware that this will be very slow. `rcl ecx,2` costs 8 uops on Skylake, with a throughput and latency of one per 6 cycles. On Broadwell or newer, `adc ecx,ecx` / `adc ecx,ecx` would be *much* better because `adc` is single-uop. Even on previous Intel CPUs, 2x `adc` would only be a total of 4 uops. But repeating this 31 times for other bit positions would still be horrible compared to SSSE3. (Oh, and I guess for other bit positions you'd use larger ror / rcl counts?) – Peter Cordes Aug 27 '18 at 12:18
@PeterCordes Is single-operand `rcl` similarly slow? While it is not surprising that variable count `rcl` is slow (likely cannot reuse the barrel shifter), I would be very surprised if `rcl r32` and `rcr r32` were just as slow. – fuz Aug 27 '18 at 13:56
1

Right, the opcode with an implicit 1 count for `rcl` / `rcr` is only 3 uops / 2c latency (https://agner.org/optimize/). So it's better, but `rcl ecx` is still worse than `adc ecx,ecx` even on Haswell and older, because of the funky flag semantics for rotates (leaving some others unmodified). And BTW, NASM uses that no-immediate opcode for `rcl eax, 1`. – Peter Cordes Aug 27 '18 at 13:59

score 2 · Answer 2 · answered Aug 27 '18 at 14:28

You only have to toggle the both bits, if they differ. You don't need to do anything, if the bits both are set or both are cleared:

%include "asm_io.inc"
segment .text
    global  asm_main
asm_main:
    enter 0,0
    pusha

    ; Test values, comment it as needed
;   mov eax, 0x00123400         ; Bit0 and Bit31 are cleared
    mov eax, 0x00123401         ; Bit0 is set, Bit 31 is cleared
;   mov eax, 0x80123400         ; Bit0 is cleared, Bit31 is set
;   mov eax, 0x80123401         ; Bit0 and Bit31 are set

    dump_regs 1

    bt eax, 0                   ; Copy the least significant bit into CF
    setc cl                     ; Copy CF into register CL
    bt eax, 31                  ; Copy the most significant bit into CF
    setc ch                     ; Copy CF into register CH
    cmp cl, ch                  ; Compare the bits
    je skip                     ; No operation, if they don't differ
    btc eax, 0                  ; Toggle the least significant bit
    btc eax, 31                 ; Toggle the most significant bit
    skip:

    dump_regs 2

    popa
    mov eax, 0
    leave
    ret

Another idea is to use TEST and to operate according to the flags - advantage: you don't need an additional register:

%include "asm_io.inc"

segment .text
    global  asm_main
asm_main:
    enter 0,0
    pusha

    ; Test values, comment it as needed
;   mov eax, 0x00123400         ; ZF PF
    mov eax, 0x00123401         ; -  -
;   mov eax, 0x80123400         ; SF PF
;   mov eax, 0x80123401         ; SF


    test eax, 0x80000001

    dump_regs 1

    jns j1
    jnp skip
    j1:
    jz skip

    doit:                       ; Touch the bits if (SF & PF) or (!SF & !PF)
    btc eax, 0                  ; Toggle the least significant bit
    btc eax, 31                 ; Toggle the most significant bit
    skip:

    dump_regs 2

    popa
    mov eax, 0
    leave
    ret

why `btc` sequence instead of `xor eax,0x80000001` to flip them both? Has it some major advantage? I don't think so. — Ped7g, Aug 27 '18 at 14:41
@Ped7g: Right, I didn't think of that. Let's leave the sequence for study. It's easier for beginners to understand. — rkhb, Aug 27 '18 at 14:52
If you rotate both bits into the low byte, you can set PF from them in one `and` or `test`. `test r32, imm8` doesn't exist, so prefer `and r32, imm8` or `test r8,imm8`. See my answer. — Peter Cordes, Aug 27 '18 at 14:59

score 2 · Answer 3 · answered Aug 27 '18 at 14:58

Use a temporary register (other than EFLAGS) to make this lower latency on CPUs without single-cycle adc:

mov    ecx, eax

bswap  eax
shl    eax, 7             ; top bit in place
shr    ax, 7+7            ; bottom bit in place (without disturbing top bit)

and    ecx, 0x7ffffffe    ; could optimize mov+and with BMI1 andn
and    eax, 0x80000001
or     eax, ecx           ; merge the non-moving bits with the swapped bits

On Intel CPUs before Sandybridge, shr ax and then reading EAX will suck (partial register stall).

This looks like 5 cycle latency from input to output, same as the adc/adc version of @Fuz's on CPUs where that's single-cycle latency. (AMD, and Intel since Broadwell). But on Haswell and earlier, this may be better.

We could save the mov using either BMI1 andn with a constant in a register, or maybe BMI2 rorx ecx, eax, 16 to copy-and-swap instead of doing bswap in place. But then the bits are in less convenient places.

@rkhb's idea to check if the bits differ and flip them is good, especially using PF to check for 0 or 2 bits set vs. 1. PF is only set based on the low byte of a result, so we can't just and 0x80000001 without rotating first.

You can do this branchlessly with cmov

; untested, but I think I have the parity correct
rorx    ecx, eax, 31     ; ecx = rotate left by 1.  low 2 bits are the ones we want
xor     edx,edx
test    cl, 3            ; sets PF=1 iff they're the same: even parity
mov     ecx, 0x80000001
cmovpo  edx, ecx         ; edx=0 if bits match, 0x80000001 if they need swapping
xor     eax, edx

With single-uop cmov (Broadwell and later, or AMD), this is 4 cycle latency. The xor-zeroing and mov-immediate are off the critical path. The mov-immediate can be hoisted out of a loop, if you use a register other than ECX.

Or with setcc, but it's worse (more uops), or tied on CPUs with 2-uop cmov:

; untested, I might have the parity reversed.
rorx    ecx, eax, 31     ; ecx = rotate left by 1.  low 2 bits are the ones we want
xor     edx,edx
and     ecx, 3           ; sets PF=1 iff they're the same: even parity
setpe   dl
dec     edx              ; 0 or -1
and     edx, 0x80000001
xor     eax, edx

Related: [C - Swap a bit between two numbers](https://stackoverflow.com/a/45362168) uses a similar conditional xor to swap bits only if they need swapping. — Peter Cordes, Aug 27 '18 at 15:05
@tamut If you post a bit-fiddling question on Stack Overflow, it is clear that you want the best possible implementation with consideration for different microarchitectures. Anybody could write a basic implementation, so that's clearly not what you want, is it? :-) — fuz, Aug 27 '18 at 15:17
@tamut: Interesting questions are rare. We get a lot of questions asking how to do something obvious and/or well known. So I think most people answering were genuinely interested in coming up with a solution themselves to this novel problem. I know I was. — Peter Cordes, Aug 27 '18 at 15:17

score 2 · Answer 4 · answered Aug 27 '18 at 15:17

i am positively astonished by the amount and detail of the replies that have been posted here. i am very grateful for all the information you have provided, it will take some time to study and understand some of it. - in the meantime i came up with another solution myself. its probably not as effective as your solutions but i still wanted to post it and read about what you think about it.

%include "asm_io.inc"
segment .data
outmsg1 db    "Enter integer: ", 0
outmsg2 db    "Before Operation: ", 0
outmsg3 db    "After Operation: ", 0

segment .bss

input1  resd 1
input2  resd 1

segment .text
    global  asm_main
asm_main:
    enter 0,0
    pusha

    mov eax, outmsg1
    call print_string
    call read_int
    xor esi, esi    
    mov esi, eax

    mov eax, outmsg2
    call print_string
    call print_nl
    mov eax, esi
    dump_regs 1

    mov ebx,eax
    mov ecx,eax

    shr ebx, 31
    shl ecx, 31

    shl eax, 1
    shr eax, 2
    shl eax, 1
    or eax,ebx
    or eax,ecx

    mov ebx,eax
    mov eax, outmsg3
    call print_string
    call print_nl
    dump_regs 2


    popa
    mov eax, 0
    leave
    ret

to knock off the high/low bits of EAX, use `and eax, ~0x80000001`. Three shifts has no advantage; larger code-size as well as being slower. If you fix that, then this is a good way to do it. Low latency (only 3 cycles) from input to final result, I think, on CPUs with 0-latency `mov`. (https://agner.org/optimize/). I thought about doing this in my answer, but for some reason it didn't seem good in my head (because of needing 2 separate merges), but trying to avoid that ended up costing more. — Peter Cordes, Aug 27 '18 at 15:19

score 1 · Answer 5 · answered Aug 27 '18 at 15:41

Ok, so as there are already different answers, I will make my "comment" official one with some extensions to it:

    rol    eax,1        ; get the top bit down into low 8 bits
    test   al,3         ; now test the two bits, setting parity flag
    jpe    to_ror       ; if "00" or "11", skip the bit swap
    xor    eax,3        ; flip the two lowest bits (top and bottom original position)
to_ror:
    ror    eax,1        ; restore the positions of bits (top back to b31)

This is single conditional jump variant, i.e. probably not performance optimal, but should be reasonably easy to understand and doesn't use any other resource except original eax value and flag register.

Another option is to avoid conditional branch for the price of few more instructions and registers used (but should be still faster in many cases on modern CPU, because the mispredicted branching is usually real hog of CPU resources) (this is basically what OP has come with, and what I mentioned in my original comment as "to extract either each bit separately and recompose back"):

mov   ebx,eax           ; copy the original value into two new regs
mov   ecx,eax
shr   ebx, 31           ; b31 bit into b0 position (others cleared)
shl   ecx, 31           ; b0 bit into b31 position (others cleared)
and   eax, 0x7FFFFFFE   ; clear b0 and b31 in original value
or    eax,ebx           ; combining the swapped bits back into value
or    eax,ecx

If you `test` a rotated *copy* of `eax`, the critical path latency (assuming correct branch prediction) is only 1 cycle for the `xor`, or 0 if the xor is skipped. Here you have 3c latency for rol/xor/ror, plus the chance of a branch miss. — Peter Cordes, Aug 28 '18 at 03:35

BITswap assembly program

5 Answers5