SSE/AVX - VMULPD produces all zeros for small integer inputs?

Question

I'm using X64dbg to test SSE/AVX assembly instructions to better understand their behavior before using them to write code. I've been able to test the vmovapd, vbroadcastsd, vsubpd, and vaddpd instructions this way without issue.

I loaded YMM registers as follows:

YMM0: 0000000000000004000000000000000400000000000000040000000000000004
YMM1: 0000000000000002000000000000000200000000000000020000000000000002
YMM2: 0101010101010101010101010101010101010101010101010101010101010101

Then, I execute this instruction:

VMULPD ymm2, ymm1, ymm0

I'm trying to multiply YMM0 with YMM1, and store the result in YMM2, however after I execute this instruction, YMM2 contains the following:

0000000000000000000000000000000000000000000000000000000000000000

But I would expect this:

0000000000000008000000000000000800000000000000080000000000000008
(That's four 8's from 4.0 * 2.0)

According to the Intel 64 and IA-32 Software Developer's Manual on page 798, this should work:

VMULPD ymm1, ymm2, ymm3/m256

Multiply packed double-precision floating-point values 
in ymm3/m256 with ymm2 and store result in ymm1.

So what am I missing here?

You loaded integers not double precision floats. Duh. Well, technically those of course also represent some float values, presumably some very small numbers (can't be bothered to check) so the result will be zero. — Jester, Nov 15 '17 at 00:15
@Jester Oh man. Today is not my day. Lack of sleep due to baby is really messing with me! lol. Thanks for pointing out my obvious mistake once again. — Gogeta70, Nov 15 '17 at 00:17
@Jester: Yes, they're tiny denormals whose products underflow to zero. IEEE binary float representations use a biased exponent so you can compare them as integers (except for the sign bit). This is one of the reasons for the exponent bias. — Peter Cordes, Nov 15 '17 at 01:06

score 4 · Accepted Answer · answered Nov 15 '17 at 01:05

The values you loaded represent denormal (very small) doubles. Their products underflow to +0.0, even without FTZ / DAZ (flush to zero / denormals-are-zero) enabled.

And BTW, writing 0 into ymm2 isn't "does nothing". VMULPD's destination operand is write-only, so seeing it change from the debug-sentinel you loaded to all-zero proves that it did do something.

If you were looking for 64-bit packed-integer multiply, then tough. AVX2 has no packed 64-bit multiply. It does have 32x32 -> 64-bit multiply (vpmuludq) and packed 32x32 -> 32-bit (VPMULLD 2 uops on many CPUs). It may be profitable to vectorize 64x64 -> 64bit multiply using AVX2, see Fastest way to multiply an array of int64_t?.

AVX512 has 64x64 -> 64 bit multiply, though.

If your integers (and their product) can be exactly represented by double, it might be worth it to use packed-conversion to double and use vmulpd on that, because the hardware has excellent throughput for add / mul / FMA with packed double. (Haswell/Skylake: 2 vectors per clock, much better than vpmulld)

Fun fact:

IEEE float/double has the interesting property that (other than the sign bit) comparing as integers "works". This is why the exponent is biased, so that integer adding 1 to the binary representation produces the next representable value. (Implementing nextafterf is fun on SSE; I looked at it a while ago but never got around to sending a patch.)

See also https://www.h-schmidt.net/FloatConverter/IEEE754.html for int -> float binary representation. (for single-precision, but https://en.wikipedia.org/wiki/Double-precision_floating-point_format IEEE binary64 works the same as binary32.)

Fortunately, I wasn't thinking and was using integral values when I **intended** to multiply float values. Excellent answer! — Gogeta70, Nov 15 '17 at 04:30

SSE/AVX - VMULPD produces all zeros for small integer inputs?

1 Answers1