1

I'm really confused by the _mm256_cvtps_ph and _mm256_cvtxps_ph intrinsics. Intel says:

__m128h _mm256_cvtxps_ph (__m256 a) Convert packed single-precision (32-bit) floating-point elements in a to packed half-precision (16-bit) floating-point elements, and store the results in dst.

__m128i _mm256_cvtps_ph (__m256 a, int imm8) Convert packed single-precision (32-bit) floating-point elements in a to packed half-precision (16-bit) floating-point elements, and store the results in dst. Rounding is done according to the imm8[2:0] parameter

So both do the same thing, except I can set the rounding mode in _mm256_cvtps_ph ? Testing it, this does not seem to be the case:

union U256f {
    __m256 v;
    float a[8];
};

void print256_f16(const __m256h v)
{
    //printf doesn't support fp16, so convert to fp32
    __m512 fp32 = _mm512_cvtph_ps((__m256i)v);

    const U512f u = { fp32 };
    for (int i = 0; i < 16; ++i)
    {
        printf("%f\n", u.a[i]);
    }
}

int main()
{
    __m256 constant_two = _mm256_set1_ps(2.0);

    printf("Input YMM register:\n");
    print256_f32(constant_two);

    __m128h cvtxps = _mm256_cvtxps_ph(constant_two);
    __m128h cvtps = _mm256_cvtps_ph(constant_two, _MM_FROUND_TO_NEAREST_INT);

    printf("_mm256_cvtxps_ph:\n");
    print128_f16(cvtxps);

    printf("_mm256_cvtps_ph:\n");
    print128_f16(cvtps);
}

Prints:

Input YMM register:
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
_mm256_cvtxps_ph:
-0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
_mm256_cvtps_ph:
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000
2.000000

Since I'm getting garbage out for _mm256_cvtxps_ph, how is it supposed to be used?

Edit:

The code above was compiled in clang, but VS2012 and Intel C compiler produce the same output. Here is the disassembly from clang:

    __m128h cvtxps = _mm256_cvtxps_ph(constant_two);
00007FF631F41DF7  vcvtps2phx  xmm0,ymmword ptr [rbp+40h]  
00007FF631F41DFE  movdqa      xmmword ptr [rbp+1C0h],xmm0  
00007FF631F41E06  movdqa      xmm0,xmmword ptr [rbp+1C0h]  
00007FF631F41E0E  movdqa      xmmword ptr [cvtxps],xmm0  
    __m128h cvtps = _mm256_cvtps_ph(constant_two, _MM_FROUND_TO_NEAREST_INT);
00007FF631F41E13  vmovups     ymm0,ymmword ptr [constant_two]  
00007FF631F41E18  vcvtps2ph   xmm0,ymm0,0  
00007FF631F41E1E  movdqa      xmmword ptr [rbp+1F0h],xmm0  
00007FF631F41E26  movdqa      xmm0,xmmword ptr [rbp+1F0h]  
00007FF631F41E2E  movdqa      xmmword ptr [cvtps],xmm0 

Looking at the intrinsics guide, these are the expected instructions.

user1850479
  • 225
  • 2
  • 12
  • What gcc version and what hardware do you have? I can't find an option to enable the cvtxps instructions. – Tim Roberts Jul 09 '23 at 04:37
  • `vcvtps2phx xmm0,ymmword ptr [rbp+40h]` - did clang earlier copy `[constant_two]` to the stack? It didn't do that with `vcvtps2ph`. Also, how did you get clang to emit `vmovups` for some instructions but legacy-SSE encodings of `movdqa` in the same function? If AVX is enabled, GCC and clang use VEX encoding for every instruction that supports it. (Doesn't look like a correctness problem, though.) – Peter Cordes Jul 09 '23 at 19:05
  • Or is `[constant_two]` actually a local variable that your disassembler is inventing a symbol name for? Like with `[cvtps]` which also isn't a global in static storage. Anyway it's odd that this debug build (?) isn't loading from the original `[constant_two]` for the `x` version. – Peter Cordes Jul 09 '23 at 20:34

1 Answers1

2

The non-x conversion asm instructions were new in F16C (conversion only), since Ivy Bridge.
The x versions were new in AVX512-FP16 (includes FP16 math), since Sapphire Rapids.

In the intrinsics API, the old F16C intrinsics use __m128i (treating it as an integer vector), but AVX-512FP16 intrinsics introduce a new vector type, __m128h / __m256h / __m512h.


According to the asm manual (https://www.felixcloutier.com/x86/vcvtph2ps:vcvtph2psx and https://www.felixcloutier.com/x86/vcvtps2ph vs. https://www.felixcloutier.com/x86/vcvtps2phx), the new VCVTPH2PSX (AVX512-FP16) can use a 16-bit broadcasted memory operand as a source, unlike the non-x version.

The x version of PS to PH supports a rounding override in the EVEX prefix (for 512-bit destination only, thus 256-bit source), as well as 32-bit broadcast source. The VEX-encoding of the non-x version (VCVTPS2PH) are the same code-size as EVEX VCVTPS2PHX, since it requires a 3-byte VEX prefix + immediate vs. 4-byte EVEX.

The only reason to have a different intrinsic for the x version of PH to PS is that some compilers let you use intrinsics without enabling the instruction sets (notoriously MSVC which only provides a /arch:AVX512 without any way to specify which specific AVX-512 extensions). And there's no intrinsic that takes a pointer, so it's up to the compiler to fold a load like _mm_set1_ph(_Float16) or _mm_set1_epi16(short) into a memory source operand instead of a separate vpbroadcastw or something. But MSVC wouldn't know it was allowed to do that if there was only the old intrinsic which didn't imply anything beyond AVX-512BW or just AVX1+F16C

MSVC also apparently doesn't support _mm_set1_ph yet. (As discussed in comments on AVX512-FP16 intrinsics fails in release mode, works in debug)


MSVC has at least some bugs with AVX512-FP16 support, again see AVX512-FP16 intrinsics fails in release mode, works in debug.

Your code works fine when compiled with GCC (on Godbolt), after fixing it to use 256-bit vectors like your main expects, and stuff like that, and commend out the call to the missing print256_f32. After GCC's done optimizing, it only uses vcvtps2ph, not the x version, so it runs on older CPUs including the AWS instances Godbolt runs on, despite compiling with -march=sapphirerapids.

This is a missed-optimization; it could I think assume the MXCSR rounding mode and use VCVTPS2PHX xmm, DWORD PTR .LC3[rip] instead of a separate vbroadcastss. (Even with -ffast-math it misses that optimization.)

But anyway, it's probably just MSVC being buggy again if you're using MSVC. It obviously has FP16 bugs, so use a better compiler like GCC or Clang.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for clarifying, I missed that one was F16C, so that makes a bit more sense. After your help with the FP16 issue, I switched to clang, which I used for the above code (although VS gives the same result). I wonder if there is some errata with vcvtps2phx on Alder Lake? It officially had AVX512 support disabled just before launch, so I suppose not many people have used the more obscure instructions. – user1850479 Jul 09 '23 at 18:41
  • @user1850479: Can you provide an actual [mcve] of a full program that gives unexpected results with clang? I may have fixed something without realizing it when adapting the code in your question to actually compile (the Godbolt link in my answer). As I said, the way I compiled resulted in a program with no run-time use of `VCVTPS2PHX`, just constant-propagation. With a MCVE, we can see if the same machine code works differently in SDE on a simulated sapphire rapids; only then do we have to consider Alder Lake's unofficial AVX-512 support being not fully working. – Peter Cordes Jul 09 '23 at 19:01