First of all, your biggest problem is that the value you're looking for in b
is NOT the same as a
. x86 is little-endian, memcpy
from a
to b
(or any other byte-at-a-time copy without byte-swapping) would actually produce:
a db 12h, 34h, 56h, 78h, 9Ah, 0,0,0 ; added padding
b dd 78563412h, 0000009Ah
Your b dd 12345678h, 9A000000h
has the first dword endian-swapped, and the 5th byte of a
as the MSB of the 2nd dword in b
, not the LSB.
Copying 5 bytes from a to b leaves the last 3 bytes of b
uninitialized. (In Unix, .bss space is zero-initialized. I assume this happens for dup(?)
space in MASM/TASM, but if not, whatever garbage was there before will still be there.)
If you copy 8 bytes from a to b, the three bytes after the 9A
will be read from the start of b
if they end up in the same section (rather than b
going into bss. Perhaps this is why you used an org
directive to separate them in your answer.
If you don't have any special reason to want to copy a dword all at once, then in 8086 code you should just use rep movsw
, or normal mov
instructions, like
mov ax, [a] ; If your addresses are static, might as well just use
mov dx, [a+2] ; absolute addressing, esp in 16bit code where it's only 2B
mov [b], ax
mov [b+2], dx
Note that your loop with si
and di
only increments them by 1, but you load/store two bytes. Unaligned overlapping loads/stores work, but you're doing redundant work.
For your case, you have 5 bytes to copy. You could use rep movsb
with cx=5
. 8086 of course doesn't support movsd
or movsq
, and rep
startup overhead makes it inefficient for small copies.
If you do care about doing both loads at once, e.g. from a dword that an interrupt handler can modify:
On a single-core CPU, we don't have to worry about memory being modified by other concurrent threads. However, an interrupt (maybe triggering a context-switch to another thread) could arrive between any two instructions, but not in the middle of a single instruction. (This is the big difference between single-core atomicity and multi-core: on a multi-core).
So, if you're loading a dword that can be modified asynchronously (e.g. by an interrupt handler), and you want to load both halves of it at once, you need to get both halves with a single instruction.
Do not use this if you're just writing normal single-threaded programs without interrupt handlers.
One way is with Sep Roland's les
trick (see his answer), but that leaves ES
temporarily set to something weird, which might be a problem depending on your interrupt handler.
Another way uses the x87 FPU (not guaranteed to exist on 8086), but you can use it to copy in 32 or 64-bit chunks. e.g.
fild dword ptr [a] ; load 32bits as an integer
fistp dword ptr [d] ; store as the same integer
; also works with qword ptr
; or store to the stack and then load into dx:ax with two mov instructions
; your own stack memory is private, so you don't need atomic ops there
x87's internal 80-bit FP format can exactly represent every 64-bit integer, so this works on any possible bit-pattern. (fld
/fstp
wouldn't, because fld
requires a valid IEEE double-precision floating point representation, unlike fild
.)
Even on 8086, it will be atomic with respect to interrupts. fild dword
is atomic for aligned loads on 486 and later hardware.
gcc actually uses this to implement C++11 std::atomic<uint64_t>
loads/stores in 32-bit mode (since the ISA guarantees that naturally-aligned loads/stores of 64-bit and smaller values are atomic, on P5 and later).
gcc used to bounce std::atomic<double>
values around with fild/fstp when SSE2 wasn't available, but that was fixed after I reported it. (I noticed the issue while answering Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs)
See Agner Fog's Optimizing Assembly guide for other useful tricks. (And also the x86 tag wiki).