You don't need the div
instruction to divide by 16, just shift by 4 bits, or AND with 0x0f. Hex is special because the base is a power of 2. (Unlike splitting numbers into decimal digits, where you do need division by 10.)
Never use div
where the divisor is known to be a power of 2, especially not an assemble-time constant.
The "clever" way to do this would be to do a word load to get both values, then rotate that word so the 2
(low nibble of A) and 3
(high nibble of B) are in the same byte. Swap them with a 4-bit rotate of that byte, then word rotate the opposite direction from the first to put nibbles back where they were, with the swap still done.
; A and B are known to be adjacent, so a 2-byte load from A gets B:A in AH:AL = AX
mov ax, word ptr [A] ; Z Y X W in MSB-first notation, AH then AL
;; In your case: 3 4 1 2 (Z and W are the 4-bit chunks we want to swap)
rol ax, 4 ; Y X W Z (rotate the extreme nibbles to share one byte)
rol al, 4 ; Y X Z W (swap halves of AL)
ror ax, 4 ; W Y X Z (undo the first rotate)
;; In your case: 2 4 1 3
mov word ptr [A], ax ; and store back to memory
If we'd started with ror ax, 4
(or rol ax, 12
), we could have used rol ah,4
because the pair we want to swap would be together in AH instead of AL. Swapping 4-bit halves of a register can be done with ROR or ROL, it doesn't matter.
If you're programming for original 8086 (without 186 features like immediate shift counts), put 4
in CL and use ror ax, cl
or whatever.
We could have done this with 3 memory-destination rotates, but that would suck for efficiency compared to using a register.
If we'd needed to swap Y and W for example, you'd need an extra rotate step because no word rotate can bring them together into the same byte. And a different approach might be better, like a bithack using xor to swap only the high 4 bits between AH and AL (kind of like an xor-swap, but masking tmp values), before a word store of AX.
The "non-clever" way would be more complicated, basically doing bit-field extract / insert with shr
and/or and al, 0x0f
to extract, and or
to insert (after clearing the destination bits with and
).
If you understand rotates, and x86 endianness, the "clever" way is short enough that it's easy to understand and verify. But if not, shr
, shl
, and
, and or
are going to be your building blocks for.