1

I've written this algorithm where three condition bits are evaluated to determine how text should be aligned based on a given address.

7 = Justify right
6 = Justify center
5 = Justify left.

Currently, there isn't a default implemented so if nothing is defined, ZF is returned indicating nothing happened. Text not being on the screen where expected will be an indicator too.

        test    al, 111000000B
        jz      Done

Now this solution works for me because as part of the ABI I'm designing, where CF denotes error. This addresses the need and will work for 16/32/64 bit, but I want to think there is a more efficient way or at least one that isn't so weighty.

    0  0FBC46FE          bsf ax,[bp-0x2]
    4  88C3              mov bl,al
    6  0FBD46FE          bsr ax,[bp-0x2]
    A  28C3              sub bl,al
    C = 12 bytes

As there is special processing for each condition at epilogue, I could catch it there by rotating bit AH into carry and keep doing that until CY and ZF. If there is a carry and non-zero then I know there was more than 1 bit set or I could just ignore extraneous bits and then function would be prioritized by bit position.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Shift_Left
  • 1,208
  • 8
  • 17
  • If they're mutually exclusive, why do you want 3 separate bits, instead of encoding a 0..2 or 1..3 integer into 2 bits? You can check that at least one of the highest 2 bits is set with `cmp al, 00111111b` / `jna Done`. As a bonus, that sets CF in the no-bits-set case. And you can use the 1..3 value as part of a computed jump if you want to dispatch that way instead of with cmp + ja after verifying the input. – Peter Cordes Feb 16 '18 at 05:00
  • If you're optimizing for code-size, debug the code first (with optional checks in place, doesn't matter if they're efficient), then build it without the checks. If it still works, then you have an efficient binary that doesn't waste code bytes verifying things that are supposed to be invariants. especially if the failure mode is as innocent as letting one bit "win" when multiple are set, not crashing. – Peter Cordes Feb 16 '18 at 05:06
  • 1
    What set of ISA extensions are you targeting? does `and ax, 11100000b` / `popcnt bx, ax` / `cmp bl, 1` or something work for you? (`popcnt` normally goes with SSE4.2 / Nehalem). Any VEX-coded BMI / BMI2 instructions won't work in 16-bit mode, but it might still be good to use the **bithack [test for an integer being a power of 2](https://stackoverflow.com/questions/108318/whats-the-simplest-way-to-test-whether-a-number-is-a-power-of-2-in-c): `(n & (n - 1)) == 0`** (actually test for having *at most* 1 bit set, not *exactly 1* bit set.) – Peter Cordes Feb 16 '18 at 05:14
  • @PeterCordes the mutual exclusivity aspect probably is a habit from AVR, so at this point I never have a definitive idea what I want to do, but that does lend itself well to bit shifting that I mentioned in the last paragraph. I think what will be most efficient here is moving bit 7-5 into 1-0 (range 1 to 3) where it's much more conducive to computed jumps. So far, only one other bit is being used and it is a toggle that can be masked out leaving me with a unambiguous value that can even be shift for vectors into word or dword arrays in real mode. – Shift_Left Feb 16 '18 at 05:59
  • @PeterCordes This real-mode incarnation is far more comprehensive than it needs to be, but is the foundation for it's protected and long-mode equivalents, so the hardware I have now is SSE4 and AESNI capable. Thanks for the link into **bithack**, but I better work more to getting out of real-mode. – Shift_Left Feb 16 '18 at 06:12
  • Why do you say "but"? You can implement that bithack with maybe 3 instructions. Or did you mean you're going to abandon the 16-bit implementation of this entirely? – Peter Cordes Feb 16 '18 at 06:21
  • 1
    @PeterCordes Boot and initialization to the extent of verbosity and user interactivity is still on track, but this routine that is essentially a really souped up **INT 10H AH=13H** is growing far beyond what it needs to be, but because protected and long mode implementations are going to be very similar not a complete waste. So to that extent, yes I'm going to abandon this after two more functions and move on toward the ultimate objective. – Shift_Left Feb 16 '18 at 09:45

1 Answers1

3

You might want to redesign this to use 3 or 4 values in a 2-bit field, instead of a bitmap, if they're all mutually exclusive anyway. (See my comments on the question).

After masking off the non-bitmap bits:

One pretty good way is to use the (n & (n - 1)) == 0 bithack to an integer for being a power of 2. (Or actually, to test that it has at most 1 bit set, not exactly 1 bit set, because that expression is true when n == 0).

You can implement that with lea edx, [rax-1] / and edx, eax / add edx, -1, which leaves CF set if eax had more than 1 bit set, otherwise CF is clear. (add edx, -1 carries for every edx value except 0).


On CPUs with BMI1

VEX prefixes are not usable in real-mode or virtual-8086 mode (for compat with some old real-mode conventions of using the same previously-illegal bit pattern as a trap). But they are usable in 16-bit protected mode I think, and definitely in 32 / 64-bit mode. This is relevant because blsr is encoded with a VEX prefix.

There's an extremely efficient way to implement that bithack with blsr (Reset Lowest Set Bit), because it sets flags in a way that's useful for this. Almost as if instruction-set architects knew what they were doing...:

and   eax, 11100000b     ; clear other bits

; requires BMI1
blsr  edx, eax           ; destination = a dummy register, also sets flags

jnz  multiple_bits_were_set       ; and thus edx still has a bit set
jc   input_was_zero

i.e. it does everything you want in one instruction, even using CF in a way that works for your code without a cmc. Your multiple_bits_were_set branch target could just stc and fall into input_was_zero.

There's no JCC that jumps if (CF==1 or ZF==0), only ja (CF=0 and ZF=0) and jbe (CF=1 or ZF=1). I don't think even cmc can help; we'd really need to complement ZF, not CF. (Skylake and later CPUs don't have flag-merging uops or partial flag stalls, and at worst on Haswell it would be a merging uop not a stall, if you were doing something where cmc was useful. Perhaps with a different use-case, since it doesn't seem useful with blsr.)

You might want 2 branches. Or since you're probably branching on which bit was set, maybe only use blsr to check for multiple bits set, and then use cmp al, 01000000b / jae bits_1_or_2 to sort things out into highest or 2nd highest vs. lowest (of the top 3) or none set.


If you want to only branch once for no bits or too many bits:

 AND   eax, mask

 blsr  edx, eax     ; edx = 0 iff no more than 1 bit was set in the input
 ; CF=1 iff eax==0
 adc   edx, 0
 jnz   not_exactly_one_bit_set

The adc can't wrap to zero because blsr always leaves the low bit clear, by definition. edx can only be zero (and thus ZF can only be set) if edx=0 after BLSR, and the adc didn't add anything.

If your bitmask doesn't include the high bit of the register, adc edx,edx would also work, saving one byte. But adc with an immediate 0 may is special-cased to 1 uop even on Haswell, where the general case is 2 uops there. (Later Intel CPUs always have 1 uop adc, earlier Intel CPUs don't have BMI1. AMD CPUs always have 1 uop adc.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Note that `blsr` / `cmc` / `jbe exactly_one_bit_set` would not work — it would always jump. To use `jbe`, you would need to complement ZF instead of CF. On the other hand, `blsr` / `adc` / `jnz` works great. BTW, there is a typo in the comment, which should read "CF=1 iff eax=0". – Antosha Feb 03 '21 at 15:04
  • 1
    @Antosha: Thanks, fixed both :) – Peter Cordes Feb 03 '21 at 20:39