You can swap two bytes in memory with a 16-bit rotate-by8:
rol word ptr [ebp], 8 ; byte [ebp] becomes byte [ebp+1], and vice verse
But if you already have the bytes in registers (e.g. because you loaded them so you could compare them), then it might be better to store them from registers.
Since you need to store only the low byte of the register, you need to use byte stores of al
or bl
not dword stores of edi
! Change your register allocation so you have your bytes in one of AL, BL, CL, or DL, not in the low byte of EDI. Only x86-64 makes the low byte of EDI accessible (as DIL). Use EDI as your index. (Then name EDI comes from Destination Index). Or instead of base+index, use pointer increments so EDI is pointing at the current byte or pair of bytes you might want to swap.
Thus:
movzx eax, byte ptr [edi + ebp] ; load the 1st byte
movzx edx, byte ptr [edi + ebp + 1] ; load the 2nd byte
cmp al, dl
jae noswap
mov [edi+ebp], dl ; opposite of how you loaded them
mov [edi+ebp+1], al
noswap:
inc edi
... loop logic
movzx
avoids a false dependency on the old value of EAX on CPUs that don't rename AL separately from the whole EAX. If you'd done mov al, [edi + ebp]
, some CPUs would have the old value of EAX as another input dependency for that instruction.
Note that if you're actually implementing Bubble Sort (eww, yuck), you only need to do one load per iteration. You always have one of the two values to compare in a register already. You can set up for the first iteration with a load outside the loop.
If you were doing 16-bit loads and then comparing only the low byte (e.g. as part of a sort), you're perfectly set up to swap the low 2 bytes of ebx
and store that back:
movzx eax, word ptr [edi+ebp]
cmp ah, al ; compare the low 2 bytes of EAX with each other
jae noswap
rol ax, 8 ; swap AL with AH. This is more efficient than xchg al,ah or two MOV stores.
mov [edi+ebp], ax
noswap:
This is good on its own, but the 16-bit store which partially overlaps with the next 16-bit load is kinda bad. (store forwarding stall). Loading just the new byte into AH (keeping the old byte in AL) isn't great either; that will stall for a cycle to merge when reading AH on modern Intel CPUs, and create a dependency chain.
And why did I use [edi+ebp]
instead of [ebp+edi]
? It saves a byte to make EBP the index register, because [EBP + EDI*1]
with no displacement isn't encodeable. NASM and YASM don't swap for you, because base=EBP implies the SS segment. But assuming a flat memory model, it doesn't matter, so we can manually make that optimization.
Related:
A bubblesort on 32-bit int
elements. Bubble sort in x86 (masm32), the sort I wrote doesn't work
A BubbleSort of 8-bit elements, in 19 bytes of 16-bit x86 machine code: https://codegolf.stackexchange.com/questions/77836/sort-an-integer-list/149038#149038, and a JumpDown sort of 32-bit elements.
Assembly - bubble sort for sorting string (sorting the chars in a string)
Assembly bubble sort swap
Changing between 8-bit vs. 32-bit elements is just a matter of changing register names and changing the increment from 1 to 4, once you understand how x86 partial registers work.