I'm using X64dbg to test SSE/AVX assembly instructions to better understand their behavior before using them to write code. I've been able to test the vmovapd, vbroadcastsd, vsubpd, and vaddpd instructions this way without issue.
I loaded YMM registers as follows:
YMM0: 0000000000000004000000000000000400000000000000040000000000000004
YMM1: 0000000000000002000000000000000200000000000000020000000000000002
YMM2: 0101010101010101010101010101010101010101010101010101010101010101
Then, I execute this instruction:
VMULPD ymm2, ymm1, ymm0
I'm trying to multiply YMM0 with YMM1, and store the result in YMM2, however after I execute this instruction, YMM2 contains the following:
0000000000000000000000000000000000000000000000000000000000000000
But I would expect this:
0000000000000008000000000000000800000000000000080000000000000008
(That's four 8's from 4.0 * 2.0)
According to the Intel 64 and IA-32 Software Developer's Manual on page 798, this should work:
VMULPD ymm1, ymm2, ymm3/m256
Multiply packed double-precision floating-point values
in ymm3/m256 with ymm2 and store result in ymm1.
So what am I missing here?