"cqo", "cdq" and "cwd" x86_64 instructions. Why not use just cqo?

Question

I'm not the most experienced assembly programmer, and I ran into the "cqo", "cdq" and "cwd" instructions, which are all valid x86_64 assembly.

I was wondering if there are any advantages of using cdq or cwd, when operating on smaller values. Is there is some difference in performance?

EDIT: Originally started looking into this, when calculating absolute value for one digit numbers.

For example if we have -9 value in al:

cwd
xor al,dl
sub al,dl

vs. Having it as a 32 bit value and calculating

cdq
xor eax,edx
sub eax,edx

or if we have a 64 bit value for -9

cqo
xor rax,rdx
sub rax,rdx

If the original value is 64 bits and consists of a value -9 to 9, effectively they all seem the same.

Hi, welcome to Stack Overflow. Please provide some usage examples, and specific use cases so we can assist. Try to provide as much information so we can understand the situation. :) — Selfish, Nov 19 '15 at 19:45
Okay, I added some examples. Also I heard that in 32 bit machine it's faster to use 32 bit values, instead of say bytes. Is this true for 64 bit values in the case of x86_64, or is it true at all? — Husky, Nov 19 '15 at 20:08
That looks great! Now that's it's clear, I have voted up your question so hopefully it get's more attention. — Selfish, Nov 19 '15 at 20:10
`cwd` is slow on modern micro-architectures, since it only modifies the lower part of a register, so the result depends on the old value of `edx`. In contrast, neither `cqo` nor `cdq` have a dependency on the old value of `[r/e]dx`. — EOF, Nov 19 '15 at 23:02
They are the same, they all have the same instruction opcode, 0x99. The effect depends on what architecture you target, 16 vs 32 vs 64 bits. Giving them different names just helps writing understandable code. — Hans Passant, Nov 20 '15 at 02:59

score 4 · Accepted Answer · edited May 23 '17 at 11:53

You only have a choice if your value is already sign-extended to fill more than 16 bits of rax.

If you have a signed 16bit int in ax, but the upper16 of eax is unknown or zero, you must keep using 16bit instructions. cdq would set edx based on the garbage bit at the top of eax, not the sign-bit of your value in ax.

Similarly, if you were using 32bit ops to generate a signed 32bit int in eax, the upper32 will be zeroed, not sign-extended.

If you can, use cdq. You might need cqo if you need all 64bits set in rdx.

See http://agner.org/optimize/ to learn about making asm that runs fast on x86. 32bit operand size is the default in 64bit mode, so 16 or 64bit operands require an extra prefix. This means larger code size, which means worse I-cache efficiency (and often more decode bottlenecks on pre-Sandybridge CPUs; SnB's uop cache usually means decode isn't a problem.)

16bit also has a false dependency on the previous contents of the register, since writing ax doesn't clear the rest of rax. Fortunately, AMD64 was designed with out-of-order CPUs in mind, so it avoided repeating that design choice that's inconvenient for high-performance, by clearing the upper32 when writing the low 32bits of a GP reg. (x86 CPUs already used OOO when AMD64 was designed, unlike when ax was extended to eax).

[Why do most x64 instructions zero the upper part of a 32 bit register](http://stackoverflow.com/q/11177137/995714) — phuclv, Nov 20 '15 at 03:18
@LưuVĩnhPhúc: thanks, I was too lazy to look up and link that question, which I knew existed. — Peter Cordes, Nov 20 '15 at 03:33

"cqo", "cdq" and "cwd" x86_64 instructions. Why not use just cqo?

1 Answers1