-3

given a number in AX, store the corresponding bit string in str1. if AX = 0x1234, the result should be: str1 = 0001001000111

How can I convert everything in AX to binary Do I have to use loop? How to implement this method?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Which architecture are you programming for? X86 or x86_64 (or something else?) The usual approach is to check each bit in order. You can do this with the `bt` instruction. Then, add a 0 or 1 to the string depending on what the bit is. – fuz Jun 28 '22 at 13:45
  • @DavidC.Rankin But 1234 in binary is what OP says. Confusing... – fuz Jun 28 '22 at 13:46
  • OOOOhhhh. Thank you -- went right over my head. Yes `0x1234` (decimal `4660`) was what I thought. – David C. Rankin Jun 28 '22 at 13:47
  • The programming environment I am currently using is amd64, I don't know how to get the value and convert it to binary – 0xbe61a55f Jun 28 '22 at 13:48
  • 1
    @0xbe61a55f The value is the number itself. It is already in binary. You just have to generate a string from the bits. – fuz Jun 28 '22 at 13:49
  • I don't quite understand the meaning Suppose the topic RAX=0x30e5 I have to figure out the answer b'0011000011100101\x00\x00\x00\x00' How can this be achieved – 0xbe61a55f Jun 28 '22 at 13:57
  • 1
    Your example in the question lists "0001001000111" but that would be "0_0010_0100_0111" which is 0247h. If you want to convert 1234h to a binary-digits string that would be "0001_0010_0011_0100" (or "0001001000110100"). Please fix your example. – ecm Jun 28 '22 at 14:39
  • 1
    Use math to manipulate digits. On a processor there are efficient ways to do /2 (divide by 2) and %2 (mod by 2), but it is still just math. For an example of manipulating digits in base 10 (decimal), if you have the number 89, and you want the last digit, that's 89%10=9. And if you have 89 and want the first digit, that's 89/10=8. Do you need a loop for that? No, it can be done without a loop, but a loop is an obvious choice rather than repeating the same code 16 times (once for each digit). – Erik Eidt Jun 28 '22 at 16:43
  • Near duplicate of [Convert 16 bits mask to 16 bytes mask](https://stackoverflow.com/a/67203617) - no you don't have to loop, use SSE2 to convert 16 bits to 16 bytes. (But that Q&A gets then in LSB-first order, not MSB-first printing order.) – Peter Cordes Jun 29 '22 at 02:33

1 Answers1

1

Input number in ax, and output binary string in [rdi]. If ax = 0x1234, [rdi] = "0001001000110100" with a terminating null character.

Using NASM targeting x86_64.

SSE2 optimized version

Based on Peter Cordes' post, with hand-tuned scheduling.

movd xmm2, eax
mov rdx, 0x0102040810204080
movq xmm1, rdx
mov edx, 0x30303030
movd xmm0, edx
punpcklbw xmm2, xmm2
punpcklwd xmm2, xmm2
punpckldq xmm2, xmm2
pshufd xmm2, xmm2, 0x4e; swap high64/low64
punpcklqdq xmm1, xmm1; bit-select mask
pand xmm2, xmm1
pcmpeqb xmm1, xmm2
pshufd xmm0, xmm0, 0; "0000..."
psubb xmm0, xmm1
movdqu [rdi], xmm0
mov byte [rdi + 16], 0

Old version using an unrolled loop

Each iteration has no dependency on the previous iteration, and the loop is fully unrolled.

If you are not familiar with NASM macros. Read the docs (https://www.nasm.us/xdoc/2.10rc8/html/nasmdoc4.html).

%assign i 0
%rep 16
  xor edx, edx
  bt eax, 15 - i
  adc edx, 48
  mov [rdi + i], dl
  %assign i i + 1
%endrep
mov byte [rdi + 16], 0
xiver77
  • 2,162
  • 1
  • 2
  • 12
  • 1
    Since edx is already zero, you can replace `setc dl ; add edx, 48` with `adc dl, 48`. – Nate Eldredge Jun 28 '22 at 18:28
  • 2
    You wouldn't normally want to fully unroll this with a `%rep` macro. Also, if you have x86-64, you should just use SSE2 especially if you favour speed over small code-size and don't mind leading zeros. See [Convert 16 bits mask to 16 bytes mask](https://stackoverflow.com/a/67203617) (includes compiler-generated asm) – Peter Cordes Jun 29 '22 at 01:48
  • 1
    Also, why `bt` instead of a more standard `shl ax, 1` or `add ax, ax` to shift a bit out the top (into CF)? Is that an attempt to gain some ILP (Instruction level parallelism)? You could split it into two dep chains if you wanted, with `movzx ecx, ah`, then `add cl,cl` / `adc` / `mov [rdi + i + 8], dl` and `add al,al` / `adc` / `mov [rdi + i + 0], dl`. – Peter Cordes Jun 29 '22 at 02:27
  • 1
    [SHL instruction with AND instruction to isolate each bit and jump on carry](https://stackoverflow.com/q/28640741) is a typical example of an int->binary loop, using `mov edx, '0'` / `adc dl, 0` which takes advantage of [Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?](https://stackoverflow.com/q/51664369) for SnB / IvB / Haswell. (But xor-zero / `adc dl, '0'` is better on other CPUs, being shorter.) – Peter Cordes Jun 29 '22 at 02:52
  • From my linked post: *If you want a string of ASCII `'0'` / `'1'` characters, do `pcmpeq` and `_mm_sub_epi8(_mm_set1_epi8('0'), v)`. That avoids the `set1(1)` vector constant.* Also, if you're going to movq+shuffle for the broadcast constants, you might can save code size with `mov edx, '0000'` / `movd xmm1, edx` / `pshufd xmm1, xmm1, 0`. There's no need for a 10-byte `mov r64, imm64` for that, unless you wanted to pack 2 separate constants into it for one movq to pshufd 2 ways. But you don't need the set1(1) constant, you only have the one simple constant (which you could load from mem) – Peter Cordes Jun 29 '22 at 09:22
  • (See the AVX2 version at the bottom of my linked answer; unfortunately the question I was answering asked for something weird, instead of int->base 2 serialization, but turned out to actually be wanting ASCII digits after that, so it was an XY problem.) Also, don't forget to comment your asm. The broadcast of the constants from 8 to 16 bytes were non-obvious, mixed in with their use making the interesting part of the code more cluttered. – Peter Cordes Jun 29 '22 at 09:33
  • For printing order, note that you also want `0x0102040810204080` as a mask constant, to have the high bit (`0x80`) at the lowest address (the least significant since x86 is little-endian). And after duplicating the input pair of bytes up to a pair of dwords, you should `pshufd` so you can reverse it: high dword broadcast to the low 2 dwords, low dword broadcast to the high 2 dwords. – Peter Cordes Jun 29 '22 at 09:53
  • @PeterCordes I hope the current version has no problems. Yes, I used `bt` for better parallelism. Also, `bt r32, i8` didn't have less throughput than `shl r16, 1` when I checked [uops.info]. I hadn't thought about `add ax, ax` when I posted this, but yes, it has slightly better throughput than `bt` (`0.2~0.33 vs 0.33~0.5`). Probably the dependency when using `add` is hidden behind the memory stores, but it's quite satisfactory to look at a loop without dependencies. – xiver77 Jun 29 '22 at 10:19
  • Yeah, actually `bt` is probably fine, it just takes an extra byte of machine code. Or 2 bytes for the immediate version. Note that Alder Lake-P only runs `bt` on one port (p1), while integer shifts are still on ports p06. And of course `add` on any port. AMD has slower BTS/BTR/BTC, but read-only BT reg,reg is single-uop any port on Zen. Pretty much a moot point for 64-bit mode since SIMD makes it redundant except for small bitfields. – Peter Cordes Jun 29 '22 at 10:25
  • @PeterCordes In this uiCA trace table for Skylake (https://uica.uops.info/tmp/58c0a62b777e459fb8be2a890e234214_trace.html), why is the first `mov r64` bound to port 5 making the next `movq` to stall (is this the correct term?). – xiver77 Jun 29 '22 at 10:29
  • 1
    Seems unlikely that it would actually schedule that way. In real life, when that group of instructions hits the alloc/rename/issue stage, there will be some uops in the RS already queued for various ALU ports, unless it was just after a front-end stall that gave the back end time to drain, or the previous instructions were all load/store or for different ports. I'd have hoped that if the queues for port 5 and others were equal length, and an issue group that included a port-5-only uop was being allocated, all the other uops would pick different ports if possible. That happens on later iters – Peter Cordes Jun 29 '22 at 10:35
  • Also note that your trace is considering a loop that redoes the constant setup. If you can't hoist that, but you're still calling it often, put the constant in memory. For either an SSE3 `movddup` broadcast load, or a plain `movdqa` 16-byte load. (You could maybe re-arrange to use `paddb` with `set1('1')` so it could be a memory source operand, `v + c1` instead of `c0 - v`. But only if you used a `pandn` earlier instead of `pand`, to flip 0s and 1s. I wouldn't recommend it for an answer to a beginner question; that's unnecessarily complicated and harder to understand.) – Peter Cordes Jun 29 '22 at 10:41