Sep's answer explains several problems in the attempt in the question.
I thought it might be fun to show a variation on the simple and efficient way of getting input 1 digit at a time, using the total = total*base + digit
method of accumulating a result starting with the most-significant digit, to see how that compares for simplicity and efficiency.
Also, I had a look at doing 2 or 4 bits at a time with shifts or rotates, or a multiply bithack. Or even 16 bits at a time with SSE2. See later sections of the answer for that, and discussion of variations and micro-optimizations like ending with a not
so we can build the bit-string inverted with adc
inside the loop.
This based on Sep's "bonus" version with the loop body rewritten. I simplified the loop condition to exit on any invalid character including newline, instead of ignoring and continuing to loop on anything except newline. I also removed any loop counter; the user can enter more digits if they want before pressing enter, we truncate to the last 4.
.model small
.stack
.data
msg db 13, 10, 10, "Enter a base-2 number up to 4 bits long: $"
hex db "0123456789ABCDEF"
.code
START:
mov ax, @DATA
mov ds, ax
mov dx, OFFSET msg
mov ah, 09h
int 21h
xor bx, bx ; Integer value of the binary string input
mov ax, 0100h ; zero AL, AH=01h = DOS char input
More:
; append the previous bit / digit (0 on the first iteration instead of jumping over this)
shl bl, 1 ; bx = (bx<<1) + CF shift in the new digit
or bl, al
; get a new char and check it for being a base-2 digit. AH still set.
int 21h ; digit -> AL
sub al, '0'
cmp al, 1
jbe More ; Exit on any invalid input, including newline
Done:
and bx, 0Fh ; truncate aka mod 16 in case the user entered a larger number
mov dl, [hex + bx]
mov ah, 02h
int 21h
mov ax, 4C00h ; Terminate the program
int 21h
end START
Efficiency of the loop is mostly irrelevant since we're calling a slow int 21h
I/O function, but it would be interesting if we were reading from an array with a similar loop. And small machine code size in bytes is always nice, especially for retro-computing or for I-cache footprint.
The loop condition is a range-check for the input being '0'-'1'
. So when we loop back to the top, FLAGS are set according to cmp al, 1
where AL was 0 (ZF=0, CF=1) or 1 (ZF=1, CF=0). Input bytes outside that range result in (ZF=0, CF=0), so jbe
wasn't taken.
Since the range only contains 2 integers, making the top one special by using cmp al, 1
/ jbe
instead of cmp al, 2
/ jb
(aka jc
) does leave us with CF set opposite to the bit value we want to append to the total we're accumulating (CF=!AL), as well as excluding all values outside the range.
As Sep points out in comments, cmc
/ adc bx,bx
would be one way to use that CF value instead of the AL value with shl
/or
. That could save 1 byte of code size (or 2 since we could mov ah, 1
instead of mov ax, 0100h
.)
But our current code falls into the loop with CF=0 from xor bx,bx
, so the total = total*base + digit
code would run with digit=!CF=1. We'd either need to enter the loop with a jmp
to the code that gets a new character (skipping the cmp/adc), or we'd need an extra stc
before falling into the loop, or partially peel the first iteration, reading an input character and branching to either skip or fall into the loop.
In the current loop, where we use AL for the digit instead of CF, running the total = total*base + digit
code with digit=0 has no effect. (And we were able to zero AL for 1 extra byte of code size but no extra instructions, as part of setting AH=01h.) For code-golf (minimum code-size regardless of efficiency), we might use stc
outside the loop and cmc
/adc bx,bx
for a net saving of 1 byte of code size, at a cost of one more instruction outside the loop, and using adc
which is slower on Intel (3 uops: https://agner.org/optimize/) from P6 until Broadwell.
The bit-value we want to append is also present non-inverted in ZF, but only CF has special instructions to add it (adc/sbb) or shift it (rcl/rcr) into registers. With 386 setz al
, we could re-materialize ZF into a register, but that would be silly because we already have the 0 or 1 digit value in AL.
Other fun things you can do with CF include sbb reg, -1
to add !CF (either reg -= -1 + 1
or reg -= -1 + 0
), but that's not useful here without first shifting bx
, and it's inconvenient to do that without modifying FLAGS. Also inconvenient to do it before the loop condition which sets FLAGS.
Setting CF according to AL != 0 can be done with add al, 255
(2 bytes), but add al, 255
/ adc bx,bx
isn't better than a simple shl bl,1
/ or bl, al
in our case.
We could pull the inversion out of the loop by starting with BX=-1, and doing a not
afterwards, before the and bx, 0Fh
. This would allow just adc bx,bx
at the top to append CF to BX, building up an inverted bit-pattern which we flip at the end. Starting with BX=all-ones (0xFFFF = 2's complement -1
) is the same as starting with BX=0 and inserting non-inverted bits. The first iteration would still need to enter with CF=1 or skip the bx = (bx<<1)+CF
part, e.g. from using stc
.
If we had a number zero-extended into AX (easy if we were looping over an array), we could use 386 instructions like lea bx, [ebx*2 + eax]
since 32-bit addressing modes have a 2-bit shift count for the index, and not as restricted in which registers can be used. We could also make the operand-size 32-bit as well, to handle inputs up to 32 bits long.
Performance, and partial registers on semi-modern CPUs like P6
Putting the conditional branch at the bottom of the loop means you only need one branch total, which is generally better, both for code size, and performance although that's mostly irrelevant for a loop with I/O. Ways to handle the first iteration include peeling an initial input and check (doing the first char input before entering the loop), or a jmp
to the input+condition at the bottom, or what I'm doing here, arranging the initial state so we can just "fall into" the loop and add a zero.
Inside the loop I could have shifted shl bx, 1
instead of just bl
, which would allow the same loop to work for inputs of up to 16-bit integers without truncating them. But read+writing a 16-bit register after writing its low 8 bits with or bl, al
would create a partial-register stall on older Intel CPUs. But or bx, ax
would also create a partial-register stall reading AX, and more importantly would OR a non-zero high bit from AH=1.
(We're probably creating a partial-register stall anyway by writing BL, assuming the int 21h
handler pushes and pops BX at some point, thus writing the full register. If not for that, the xor bx,bx
xor-zeroing ahead of the loop will let P6-family (PPro to Nehalem) CPUs know that BX=BL so they don't stall when reading the full BX later. But any write of the full BX breaks that upper-bits-known-zero internal state. Only P6-family renames BL separately from BX, so other microarchitectures don't have this penalty. First-gen Sandybridge can still rename BL separately from EBX, but not BX from EBX so xor bx,bx
isn't a zeroing idiom there.)
cmc
or add al, 255
(to set CF = AL!=0) / adc bx,bx
would work without partial-register stalls, but adc reg,reg
is 2 uops on Intel before Broadwell. (So is rcl bx, 1
; 3 uops even on later CPUs like Skylake according to https://uops.info/, although only 1 on AMD Zen.)
Due to partial-register considerations for P6-family, if I wanted to accumulate 16-bit or maybe 32-bit values, I might use cmc
/adc bx,bx
, or add al,255
/adc bx,bx
, if I cared about those CPUs not just 8086.
CMC was fast on original 8086 (timing tables for 8088 through 586 which don't consider code-fetch bottlenecks), and is still efficient on modern x86, single uop with 1 cycle latency. Using it instead of AL directly makes the latency chain from AL getting an ASCII digit longer by 1 cycle (sub/cmp/cmc/adc vs. sub/add), but that's not part of the loop-carried dependency chain through BX. In fact, on CPUs with efficient adc
, that's better because adc bx,bx
is 1 cycle latency from the old BX to the new BX. vs. shl bl,1
/or bl,al
being a chain of two operations.
Loading 2 bytes -> 2 bits from a string, or a 4-byte multiply bithack
Consider a case where we had a string in memory, and we know the array/string length ahead of time, so we don't have to check each digit separately for out-of-range.
We could load 2 bytes at a time and shuffle the bits together. With binary digits in printing order in a string (most-significant first, at lowest address), and x86 being little-endian, a 2-byte load will have the bits in the opposite order of what we want. So we actually need to shift or rotate the low bit of the top half into the bottom half.
Sep suggested in comments ror ah, 1
/ rol ax, 1
to first get the bit we want from the 2nd byte (AH) to the top of the register, then rotate it around next to the bit at the bottom of AL. That's not ideal on modern Intel CPUs (Sandybridge-family) where rotate-by-1 costs 2 uops (https://uops.info/) because of how it has to update FLAGS leaving some unmodified, unlike rotate by immediate. And it will have partial-register stalls on P6-family (PPro through Nehalem).
8086 compatible code that runs ok on all CPUs, although not perfect for some of them. rol
by 1 is single-uop on P6-family and adc
is 2, but they have partial-register stalls for multiple cycles when reading a wider register after writing a narrower part of it. For P5-pentium (in-order dual issue), this could be scheduled to allow pairing of shr bh,1
with mov ax, [si+2]
at least.
mov bx, [si]
;ror bh, 1 ;rol bx, 1 ; would be 4 uops on some CPUs
shr bh, 1 ; shift the bit we want into CF
adc bl, bl ; shift it into BL. 2 uops on some CPUs, but not terrible.
mov ax, [si+2] ; shift the next 2 bits into BL separately
shr ax, 1 ; first the low-address bit (also moving the last bit into AL)
adc bl, bl
add al, al ; then shift the highest bit out the top of AL
adc bl, bl
and bx, 0x0f ; P6 partial-register stall when reading BX after writing BL
; P6 would prefer and al, 0xf / movzx bx, al or something.
mov dl,[Hex+bx] ; lookup table, either ASCII digits or 7-seg codes
For throughput and to avoid writing AH (not a big deal unless something later reads AX, but could use another physical register on Sandybridge-family), I shifted the whole AX and used an add
instruction instead of yet another shift. (adc
and shr
compete for ports 0 and 6 on Haswell.) The "simpler" version would be shr al, 1
/ ... / shr ah, 1
which in theory has instruction-level parallelism, and does in practice on P6-family which renames AL separately from the full EAX.
Many alternatives are possible especially with 386 instructions, for example prepare BX, then ror ah, 1
/ ror ax, 1
/ shld bx, ax, 2
to shift 2 bits from the top of AX into the bottom of BX. But SHLD is slow on modern AMD (Zen). I couldn't find a way to use rotates without multiple partial-register stalls on P6-family, so adc
seemed the least bad. The best choice for any actual use-case would depend on which CPUs you care about. Instead of shr bh,1
, bt bx, 8
would get the bit into CF without creating partial-register problems in EBX. Fully efficient on Intel CPUs like Core 2, 2 uops on AMD Zen.
If we wanted to use xlat
, we wouldn't need to extend to 16 bits, maybe avoiding partial-register stalls on P6. Or with 386 instructions, get two separate 2-bit values, mask high garbage, and combine with lea bx, [ebx + eax*4]
To get the other order, ror al, 1
/ shr ax, 7
is possible, but has a partial-register stall on Intel P6-family. (386 and later having a barrel shifter for efficient shifts by more than 1.)
With a fast multiply ALU (like P6 and later), 4 bits at once are even possible
mov eax, [si]
and eax, 0x01010101 ; saves 1 byte vs. loading into EBX
imul ebx, eax, 0x08040201 ; top byte = (low byte * 8) + (byte#1 * 4) + ...
shr ebx, 24 ; result = low 4 bits of EBX
movzx edx, byte ptr [Hex+bx] ; lookup if desired
; modern CPUs prefer writing a full reg, not merging a new DL into EDX
See How to create a byte out of 8 bool values (and vice versa)? - recall that multiply adds shifted copies of the input, and if those adds don't carry-out across bitfields in your input, you can use it as a multi-shift and combine or add. We're reversing the multiplier here, with 0x08
in the most-significant byte to put the bit form the low byte to the top of the high nibble, and so on. The bits that end up in the top byte of the multiply are the ones whose place-value adds up to 24, so high x low and low x high bytes, and the two middle bytes.
We can pack bits into an arbitrary order this way, no need for bswap eax
ahead of time or anything, just like we avoided ror ax, 8
2-byte swaps with the 16-bit strategy. For MMX or SSE2 SIMD (see below), we do want something like pshufb
to byte-reverse, though.
Related:
- Converting bin to hex in assembly shows a counted loop using
shr al,1
to set CF = low bit of AL, i.e. shift the low bit out of AL into CF. And rcl bx,1
to shift it in to BX. It works for numbers more than 4 bits wide since its integer to hex function uses a loop over 4-bit groups, generating multiple hex digits if needed.
- How to convert a binary integer number to a hex string? (binary as in register value, doesn't mention binary strings. It has 32-bit scalar code, and SSE2/AVX/AVX-512 ways to turn 4-bit groups into hex digits in parallel.)
- Displaying numbers with DOS - for base 10 output of 16 or 32-bit integers, potentially multiple digits.
- Displaying Time in Assembly - for 2-digit base 10, like 00 to 99, so it'll work for inputs like
1111
(F
in hex, 15
in decimal).
To do this efficiently for longer binary integers, use SSE2 SIMD: Does the x86 architecture support packing bools as bits to parallelize logic operations? / Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap - load, then pslld xmm0, 7
to shift the low bits to the top of bytes, then SSE2 pmovmskb
to pack a bit from every byte for multiple bytes at once. The low bit of '0'
is 0, and the low bit of '1'
is 1
.
To check for invalid characters with SIMD, see the range-check part of Convert a String In C++ To Upper Case , but for '0' - '1'
instead of 'a'-'z'
). To handle digits past the first invalid character, probably pmovmskb eax, xmm0
on the range-check result, and blsmsk
. Or since VEX-coded instructions like BMI1 blsmsk
aren't available in real mode, use the equivalent bithack of mask ^= mask-1
and AND with that to zero higher bits in packed low bits (i.e. the binary integer). (But that won't zero the bit corresponding to the first non-ASCII digit itself, so I guess right shift the mask by 1, or use a bithack that gives a mask up to but not including the lowest set bit.)
Also pmovmskb
will put the lowest-address digit as the lowest byte. But it's the first = highest place-value, so we need to reverse the vector before pmovmskb
, probably with SSSE3 pshufb
. If you only had a 4-bit integer, mov eax, [si]
/ bswap eax
/ movd xmm0, eax
would also work.
Once you have a 16-bit integer from pmovmskb
(or 32-bit from combining results with shift/OR), you can use more SIMD code to make a hex string representing it, as shown in How to convert a binary integer number to a hex string?