I'm really confused by the _mm256_cvtps_ph and _mm256_cvtxps_ph intrinsics. Intel says:
__m128h _mm256_cvtxps_ph (__m256 a) Convert packed single-precision (32-bit) floating-point elements in a to packed half-precision (16-bit) floating-point elements, and store the results in dst.
__m128i _mm256_cvtps_ph (__m256 a, int imm8) Convert packed single-precision (32-bit) floating-point elements in a to packed half-precision (16-bit) floating-point elements, and store the results in dst. Rounding is done according to the imm8[2:0] parameter
So both do the same thing, except I can set the rounding mode in _mm256_cvtps_ph ? Testing it, this does not seem to be the case:
union U256f {
__m256 v;
float a[8];
};
void print256_f16(const __m256h v)
{
//printf doesn't support fp16, so convert to fp32
__m512 fp32 = _mm512_cvtph_ps((__m256i)v);
const U512f u = { fp32 };
for (int i = 0; i < 16; ++i)
{
printf("%f\n", u.a[i]);
}
}
int main()
{
__m256 constant_two = _mm256_set1_ps(2.0);
printf("Input YMM register:\n");
print256_f32(constant_two);
__m128h cvtxps = _mm256_cvtxps_ph(constant_two);
__m128h cvtps = _mm256_cvtps_ph(constant_two, _MM_FROUND_TO_NEAREST_INT);
printf("_mm256_cvtxps_ph:\n");
print128_f16(cvtxps);
printf("_mm256_cvtps_ph:\n");
print128_f16(cvtps);
}
Prints:
Input YMM register:
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
_mm256_cvtxps_ph:
-0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
_mm256_cvtps_ph:
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
Since I'm getting garbage out for _mm256_cvtxps_ph, how is it supposed to be used?
Edit:
The code above was compiled in clang, but VS2012 and Intel C compiler produce the same output. Here is the disassembly from clang:
__m128h cvtxps = _mm256_cvtxps_ph(constant_two);
00007FF631F41DF7 vcvtps2phx xmm0,ymmword ptr [rbp+40h]
00007FF631F41DFE movdqa xmmword ptr [rbp+1C0h],xmm0
00007FF631F41E06 movdqa xmm0,xmmword ptr [rbp+1C0h]
00007FF631F41E0E movdqa xmmword ptr [cvtxps],xmm0
__m128h cvtps = _mm256_cvtps_ph(constant_two, _MM_FROUND_TO_NEAREST_INT);
00007FF631F41E13 vmovups ymm0,ymmword ptr [constant_two]
00007FF631F41E18 vcvtps2ph xmm0,ymm0,0
00007FF631F41E1E movdqa xmmword ptr [rbp+1F0h],xmm0
00007FF631F41E26 movdqa xmm0,xmmword ptr [rbp+1F0h]
00007FF631F41E2E movdqa xmmword ptr [cvtps],xmm0
Looking at the intrinsics guide, these are the expected instructions.