I won't repeat @fuz's answer, but I want to add:
If you had just let the assembler do its job by writing add word [myvar], 0xA5
, it would have picked the smallest encoding that worked. If your immediate had fit in a sign-extended imm8, it would have used the add r/m16, imm8
encoding. There is usually no need to use size-overrides on non-memory operands. All the major x86 assemblers optimize the size of immediate operands. Some (e.g. NASM) will even optimize mov rax, 1
into the equivalent but shorter mov eax, 1
, and stuff like that, but others (YASM) won't.
You can force the assembler to use wider immediates than necessary for padding/alignment, though. e.g. add word [myvar], strict word 1
. would use the imm16
version. (Without strict
, it doesn't stop the assembler from optimizing it to a smaller encoding.) You can also add word [rcx + strict dword 0], strict word 1
to force a [base + disp32]
encoding for the addressing mode.
When possible, avoid 16-bit immediate operands to instructions other than mov
. On many Intel CPUs, that instruction will be slow to decode, because of an LCP stall. This might not be a problem on newer CPUs that have a decoded-uop cache. But on older Intel CPUs, this will probably run faster, at the cost of a scratch register:
movzx eax, word [myvar]
add eax, 0xA5 # add ax, 0xa5 is 1B smaller, but has the same LCP stall.
mov [myvar], ax
add
/sub
carry left-to-right, so the low part of a wider add is always the same as what you'd get from a narrow add
. Avoiding LCP stalls for register operands is usually cheap (just an extra 1B for the add eax,imm32
, since it doesn't need an operand-size prefix), but the load and store are extra.
This is a lot more code-size, so it's probably slower on CPUs that don't have LCP stalls. It's only 1 more uop for the front-end on Intel Sandybridge-family (which can micro-fuse the load+add in the one-instruction version), and the same number of uops for the execution units / scheduler. (memory-destination instructions decode to load, ALU, and store uops.)