4

I have a byte-sized number in register AL (8 bits), in 32-bit x86.

I need to exchange between bit 1 (the second from right) and bit 4 (the fifth from right) of the number in register AL.

For example

00010000B    input
00000010B    output
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Gilad
  • 237
  • 1
  • 3
  • 12

6 Answers6

8

One instruction answer (needs 256-byte LUT).

xlatb  // needs ds:(e)bx to point to the LUT

One to two instructions answer.
(If you need to zero-extend al, prefer movzx to a different register if you can trash a temp reg.)

movzx eax, al                  // mask off extra bits -- perhaps not needed, 
mov   al, look_up_table[eax]   //  if upper bits are guaranteed to be zero

Three instructions using al alone.

  test al, 0x12
  jpe skip       ;; parity was even (aka the bits are the same)
  xor al,  0x12  ;; toggle both bits
skip:

Theory of operation: bits need to be exchanged only when they differ. 0 changes to 1 and 1 changes to 0 by xoring them with 1. Both bits are affected at the same time.

The jump could be avoided using a cmovpo or cmovpe conditional move if you can assume a P6-compatible CPU like all modern x86. But in this case the sequence requires 4 instructions (or maybe only 3 if some register is known to contain a zero or the bit mask). cmov isn't available in 8-bit operand-size, so we use movzx to avoid partial-register problems and enable mov-elimination on Intel CPUs.

 movzx   edx, al
 xor     edx, 0x12    ; toggle the copy
 test    al,  0x12    ; test the original
 cmovpo  eax, edx     ; use the flipped version if exactly 1 of 2 bits were set

With other masks, if either bit is outside the low 8, PF doesn't work

PF is only set according to the low 8 bits of the result, so unless you want to horizontally XOR a wider integer down to 8 bits starting with something like mov edx, eax / shr edx, 16 / xor edx, eax, you can't use PF this way.

Alternatively one could opt for testing if a & mask has only one bit set. That is done with expression (a== (a&-a)) (like BMI1 blsi), or with (a & (a-1)) == 0 (like BMI1 blsr). But both of those are true for a==0 as well as for a having exactly 1 bit set.

If you actually have BMI1 blsr, it sets CF if the input was zero, letting you distinguish zero bits set from 1 bit set.


With BMI2/PEXT

pext rcx, rbx, rax   ; rax = input, rbx = mask with 2 bits
xor  rbx, rax        ; toggled version (destroys mask)
test ecx, ecx        ; compute parity flag from the 8 bottom bits
cmovpo rax, rbx      ; use toggled version if only 1 bit was set

Here pext will extract or compress the two bits pointed by the mask rbx from the source eax to the (two) least significant bits in the destination register rcx. Since now the bits to be tested are in the proper area, they can be tested to set/clear PF.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • 2
    Wow! This is the first time I see the parity bit used, after almost 20 years of asm coding. – Nils Pipenbrinck Nov 03 '12 at 11:37
  • 2
    I edited to fix a mix of Intel and AT&T syntax, now using pure Intel syntax. I ended up noticing more things to improve, like `movzx` instead of `and eax, 255`. Also you suggested that `cmov` would need *at least* 4 instructions, but if you have a scratch register you can use, it's doable with 4 instructions without any other pre-conditions. (Added a code block). Also, your `single_bit_set( a & mask )` idea was interesting, but you need to exclude `a & mask` which I don't think you were doing. That would be the way to go if either of your two bits to swap were outside the low 8. – Peter Cordes May 29 '22 at 17:53
  • 1
    Thanks, fixed. Actually I think even `test ecx,ecx` would work equally well, because the top 30 bits are cleared. – Aki Suihkonen May 30 '22 at 04:57
  • @AkiSuihkonen: Oh yes, that is even better, only 2 bytes vs. 3 for `test cl, imm8` vs. 7 for `test rcx, 3` which NASM doesn't optimize. `test cl,cl` would be equally good, but has zero advantages (except reminding the reader that PF is only set from the low 8 anyway). I don't suggest switching. – Peter Cordes May 30 '22 at 05:03
  • `mov edx,eax` might be preferrable over `movzx edx, al` to preserve top bits of eax in both cases. – Aki Suihkonen May 30 '22 at 06:00
5

You may try this:

mov BL, AL
mov BH, AL

and BL, 2h   //empty all bits except first
and BH, 10h  //empty all bits except fourth

shl BL, 3    //move bit 1 to position of bit 4
shr BH, 3    //move bit 4 to position of bit 1

and AL, edh  //empty first and fourth bits
or AL, BL    //set bit 4
or AL, BH    //set bit 1

AL register contains the result. Also you may need data stored in register BX. If you do then prepend solution with

push BX

end append to the end

pop BX
FUT
  • 373
  • 2
  • 17
  • Thank you very much! now I'll try to figure out how to raise your reputation – Gilad Nov 03 '12 at 09:08
  • 1
    @user1461085: You can give him 15 points by clicking the check mark next to this answer, if it solved your problem. :) That also lets people know which answer helped you the most, so it's encouraged anyway. – cHao Nov 03 '12 at 09:40
  • 1
    I tried to place here the easiest solution but Aki's solution it the slimmest! The choice is up to you. P.S. It is empty tick to the left of the answer ;). – FUT Nov 03 '12 at 17:24
1

Another alternative (7 instructions):

mov ebx,00010010b         ;ebx = 00010010
and ebx,eax               ;ebx = 000A00B0
lea ebx,[ebx+ebx*4]       ;ebx = 0A0AB0B0
rol bl,5                  ;ebx = AB0B00A0
and ebx,00010010b         ;ebx = 000B00A0
and al,11101101b          ;al = abc0ef0g
or  al,bl                 ;al = abcBefAg

And another alternative (also 7 instructions):

ror al,1           ;ax = ????????.gabcAefB
ror ax,1           ;ax = B???????.?gabcAef
ror al,3           ;ax = B???????.Aef?gabc
rol ax,1           ;ax = ???????A.ef?gabcB
rol al,3           ;ax = ???????A.gabcBef?
ror al,1           ;ax = ????????.AgabcBef
rol al,2           ;ax = ????????.abcBefAg
Brendan
  • 35,656
  • 2
  • 39
  • 66
1

I add my six instruction alternative:

xor   bl, bl  ; -> clear work register
btr   ax, 1   ; -> second bit to carry, clear that bit in al
cmovc bl, 8   ; -> set bit at bitpos 3
btr   ax, 4   ; -> fifth bit to carry, clear that bit in al
rcl   bl, 2   ; -> set bit at bitpos 2, shift bitpos 3 -> 5
or    al, bl  ; -> merge bits

Note: This is just an academic exercise. You probably don't want code that use btr instructions because these are slow. At least last time I've tried to use them. Also: Untested.

Needs 486 instruction set.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • 1
    CMOV.cc has to be replaced with a valid addressing mode (e.g. 16 or 32 bit reg/reg move, or read from memory). For some reason Intel didn't include immediate move. – Aki Suihkonen Nov 03 '12 at 14:04
  • 1
    `cmov` requires is P6 (Pentium Pro) or x86-64. `btr` requires 386. Rotate-through-carry with a count other than 1 is quite slow; `btr reg, imm` is fast, at worst 2 uops on AMD, 1 on Intel P6/SnB-family. (Only slow on P5 Pentium, 7 cycles there). https://uops.info/ / https://agner.org/optimize/ – Peter Cordes May 29 '22 at 17:00
1

6 instructions using al alone

           ; al       | cf
           ; 76543210 | x
ror al, 5  ; 43210765 | x
bt  al, 4  ; 43210765 | 1
rcl al, 4  ; 07651432 | 1
bt  al, 2  ; 07651432 | 4
rcr al, 3  ; 32407651 | 4
rol al, 4  ; 76513240 | x
panda-34
  • 4,089
  • 20
  • 25
  • 1
    [`bt` doesn't have an 8-bit form](https://www.felixcloutier.com/x86/bt). You'll need `bt eax, 4`. (Which would incur a partial-register stall on P6-family CPUs like Nehalem). – Peter Cordes May 29 '22 at 17:02
1

Aki's answer using PF for test/cmovpo (odd parity) is probably best when the bits are in the low byte. Without that it's trickier. The general strategy of checking of both bits are the same or not is probably good, though, since x86 doesn't have good bitfield extract/insert instructions.

The general case: 2 bits not both in the low 8

We can shift a copy so the bits line up, then XOR and isolate that bit. a ^ (a>>distance) & mask;. That requires you to know the distance between the set bits in the mask, which is fine for known-constant masks but not for runtime-variable masks.

 mov     ecx, eax       ; alternative: rorx ecx, eax, 13  BMI2 saves a mov
 shr     ecx, 13        ; tzcnt(0x10000) - tzcnt(0x00008)
 xor     ecx, eax       ; diff the 2 bits we care about, garbage everywhere else

 mov     edx, eax
 xor     eax, 0x10008   ; toggled copy

 test    cl,  0x8       ; (or EDX if the low bit isn't in the low 8)
 cmovne  eax, edx       ; take the flipped version if the bits weren't equal

This is 7 instructions, or 6 with BMI2 rorx to copy-and-rotate. But it has good instruction-level parallelism and no partial-register stalls or false dependencies. Also avoids rcl/rcr which are slow, and very slow with count other than 1.

ror with an immediate count is single-uop on modern Intel and AMD, but a count of 1 costs 2 uops on Intel because of its weird partial-FLAGS semantics updating OF and leaving others unmodified. See https://uops.info/ and https://agner.org/optimize/


Another possible approach is checking for a single bit being set in a & mask. popcnt(a&mask) == 1 can work if you have it, otherwise tmp ^= (tmp >> 16); and so on can set up for using PF. Although at that point it's just a variation on the version above, costing more instructions to mask before copying and shifting.

BMI1 blsr / blsi have possibilities: they set CF depending on the source being zero, and ZF as usual according to the result being zero. So they can distinguish a&mask == 0 from a & mask having exactly 1 bit set so it's zero after tmp & (tmp-1).

Unfortunately the straightforward way of using blsr would need a CMOV that copies on ZF=1 && CF=0. There isn't a condition for that; cmova checks CF=0 and ZF=0. Also, it's 2 uops on recent Intel CPUs, since it has 4 inputs (reading CF and the rest of FLAGS (SPAZO group) since it needs FLAGS from both separately-renamed parts; that's what Intel since Broadwell does instead of partial-flag merge uops).

With a run-time variable mask: 0x10008 could be another register

 mov     ecx, eax
 and     ecx, 0x10008

 mov     edx, eax
 xor     eax, 0x10008   ; toggled copy

 blsr    ecx, ecx       ; ZF=1 if the result is zero (0 or 1 bit set before), CF=1 if input was 0
 cmovc  edx, eax     ; if CF=1, we always want the original value.  Copy it so next cmov has both operands the same.
 cmovz  eax, edx     ; copy if ZF=1

; There is no cmov?? that copies if ZF=1 && CF=0.

So EAX gets the toggled copy from EDX only if CF=0 (so we don't overwrite EDX) and ZF=1 (so we then copy to EAX).

This feels quite inefficient, but is the best I've thought of for runtime-variable 2-bit-set masks.

That brings this up to 7 uops, break-even with the shr/test way. (And on Intel Haswell and earlier, each cmov is 2 uops so it's worse). The only upside here is working for a runtime-variable mask that we have in another register.

Any other single instruction would still be break-even at best, but I'm not sure what else could even work. e.g. on dec ecx to affect ZF without touching CF? But not in a useful way. cmc to flip CF wouldn't help, then we'd need to cmov on ZF=1 and CF=1, but cmovbe (also 2 uops) would check CF=1 or ZF=1.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • In the 2nd version, a `jc` over the `cmovz` might be good. (Or do it the other way `jz` over a `cmovc`, if that's the more predictable branch: 2 vs. (0 or 1) bits of the pair set, instead of 0 vs. (1 or 2) with `jc`) – Peter Cordes May 30 '22 at 01:23