How to clear the upper 128 bits of __m256 value?

Question

How can I clear the upper 128 bits of m2:

__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);

don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”. At the same time I can easily do it in assembly:

VMOVDQA xmm2, xmm2  //zeros upper ymm2
VMOVDQA xmm2, xmm1

Of course I'd not like to use "and" or _mm256_insertf128_si256() and such.

What is wrong with using inline assembly? You are already processor specific if you are working with AVX intrinsics. — Sergey L., Jan 27 '14 at 16:16
Sergey: no inline assembly in 64-bit VC. Besides that, C compiler often creates faster code than I would do -- it can use a smart instr order and other tricks. — seda, Jan 27 '14 at 18:06
`_mm256_zeroupper`. Ok, it will do a bit more than you want ;-) — Marc Glisse, Feb 06 '14 at 21:00
With gcc, `__m256i y={x[0],x[1],0,0};` generates a single `vmovdqa`. — Marc Glisse, Feb 08 '14 at 11:40
@SergeyL.: A lot of things are wrong with inline assembly in the middle of something you want the compiler to optimize. https://gcc.gnu.org/wiki/DontUseInlineAsm points out that it defeats constant propagation, among other things. — Peter Cordes, Jul 09 '17 at 10:51

Cody Gray - on strike · Answer 1 · 2019-06-14T16:47:56.270

Update: there's now a __m128i _mm256_zextsi128_si256(__m128i) intrinsic; see Agner Fog's answer. The rest of the answer below is only relevant for old compilers that don't support this intrinsic, and where there's no efficient, portable solution.

Unfortunately, the ideal solution will depend on which compiler you are using, and on some of them, there is no ideal solution.

There are several basic ways that we could write this:

Version A:

ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

Version B:

ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                         ymm,
                         _MM_SHUFFLE(0, 0, 3, 3));

Version C:

ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                              _mm256_castsi256_si128(ymm),
                              0);

Each of these do precisely what we want, clearing the upper 128 bits of a 256-bit YMM register, so any of them could safely be used. But which is the most optimal? Well, that depends on which compiler you are using...

GCC:

Version A: Not supported at all because GCC lacks the _mm256_set_m128i intrinsic. (Could be simulated, of course, but that would be done using one of the forms in "B" or "C".)

Version B: Compiled to inefficient code. Idiom is not recognized and intrinsics are translated very literally to machine-code instructions. A temporary YMM register is zeroed using VPXOR, and then that is blended with the input YMM register using VPBLENDD.

Version C: Ideal. Although the code looks kind of scary and inefficient, all versions of GCC that support AVX2 code generation recognize this idiom. You get the expected VMOVDQA xmm?, xmm? instruction, which implicitly clears the upper bits.

Prefer Version C!

Clang:

Version A: Compiled to inefficient code. A temporary YMM register is zeroed using VPXOR, and then that is inserted into the temporary YMM register using VINSERTI128 (or the floating-point forms, depending on version and options).

Version B & C: Also compiled to inefficient code. A temporary YMM register is again zeroed, but here, it is blended with the input YMM register using VPBLENDD.

Nothing ideal!

ICC:

Version A: Ideal. Produces the expected VMOVDQA xmm?, xmm? instruction.

Version B: Compiled to inefficient code. Zeros a temporary YMM register, and then blends zeros with the input YMM register (VPBLENDD).

Version C: Also compiled to inefficient code. Zeros a temporary YMM register, and then uses VINSERTI128 to insert zeros into the temporary YMM register.

Prefer Version A!

MSVC:

Version A and C: Compiled to inefficient code. Zeroes a temporary YMM register, and then uses VINSERTI128 (A) or VINSERTF128 (C) to insert zeros into the temporary YMM register.

Version B: Also compiled to inefficient code. Zeros a temporary YMM register, and then blends this with the input YMM register using VPBLENDD.

Nothing ideal!

In conclusion, then, it is possible to get GCC and ICC to emit the ideal VMOVDQA instruction, if you use the right code sequence. But, I can't see any way to get either Clang or MSVC to safely emit a VMOVDQA instruction. These compilers are missing the optimization opportunity.

So, on Clang and MSVC, we have the choice between XOR+blend and XOR+insert. Which is better? We turn to Agner Fog's instruction tables (spreadsheet version also available):

On AMD's Ryzen architecture: (Bulldozer-family is similar for the AVX __m256 equivalents of these, and for AVX2 on Excavator):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |    0    |          0.25         |   0 (renamed)
   VPBLENDD     |  2  |    1    |          0.67         |   3
   VINSERTI128  |  2  |    1    |          0.67         |   3

Agner Fog seems to have missed some AVX2 instructions in the Ryzen section of his tables. See this AIDA64 InstLatX64 result for confirmation that VPBLENDD ymm performs the same as VPBLENDW ymm on Ryzen, rather than being the same as VBLENDPS ymm (1c throughput from 2 uops that can run on 2 ports).

See also an Excavator / Carrizo InstLatX64 showing that VPBLENDD and VINSERTI128 have equal performance there (2 cycle latency, 1 per cycle throughput). Same for VBLENDPS/VINSERTF128.

On Intel architectures (Haswell, Broadwell, and Skylake):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |   0-1   |          0.33         |   3 (may be renamed)
   VPBLENDD     |  1  |    1    |          0.33         |   3
   VINSERTI128  |  1  |    3    |          1.00         |   1

Obviously, VMOVDQA is optimal on both AMD and Intel, but we already knew that, and it doesn't seem to be an option on either Clang or MSVC until their code generators are improved to recognize one of the above idioms or an additional intrinsic is added for this precise purpose.

Luckily, VPBLENDD is at least as good as VINSERTI128 on both AMD and Intel CPUs. On Intel processors, VPBLENDD is a significant improvement over VINSERTI128. (In fact, it's nearly as good as VMOVDQA in the rare case where the latter cannot be renamed, except for needing an all-zero vector constant.) Prefer the sequence of intrinsics that results in a VPBLENDD instruction if you can't coax your compiler to use VMOVDQA.

If you need a floating-point __m256 or __m256d version of this, the choice is more difficult. On Ryzen, VBLENDPS has 1c throughput, but VINSERTF128 has 0.67c. On all other CPUs (including AMD Bulldozer-family), VBLENDPS is equal or better. It's much better on Intel (same as for integer). If you're optimizing specifically for AMD, you may need to do more tests to see which variant is fastest in your particular sequence of code, otherwise blend. It's only a tiny bit worse on Ryzen.

In summary, then, targeting generic x86 and supporting as many different compilers as possible, we can do:

#if (defined _MSC_VER)

    ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                             ymm,
                             _MM_SHUFFLE(0, 0, 3, 3));

#elif (defined __INTEL_COMPILER)

    ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

#elif (defined __GNUC__)

    // Intended to cover GCC and Clang.
    ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                                  _mm256_castsi256_si128(ymm),
                                  0);

#else
    #error "Unsupported compiler: need to figure out optimal sequence for this compiler."
#endif

See this and versions A,B, and C separately on the Godbolt compiler explorer.

Perhaps you could build on this to define your own macro-based intrinsic until something better comes down the pike.

I also tried inserting a lane of zeros into the upper lane of ymm: `_mm256_inserti128_si256(ymm, _mm_setzero_si128(), 1);`. gcc compiles it to an actual `vinserti128`, and clang turns it into a blend, so nothing new there. ICC compiles it to a `VMOVDQA`. — Peter Cordes, Jul 09 '17 at 10:40
Related: Intel CPUs never eliminate `vmovdqa same,same` or `mov same,same`. When the registers are different, they almost always succeed unless you have a chain of renames with no ALU stuff between. (e.g. `movdqa xmm0, xmm1` / `movdqa xmm1, xmm0` in a loop). Then some will be handled at rename time, and some will take an execution unit. — Peter Cordes, Jul 09 '17 at 10:56
If I don't use `/arch:AVX`, MSVC uses a non-AVX `xorps xmm2,xmm2` in version A!!! https://godbolt.org/g/UwSvWh — Peter Cordes, Jul 09 '17 at 11:02
Thanks for the edits, @Peter! I was a bit surprised to see Fog reporting that blend was so fast on AMD, but I didn't think to verify it elsewhere. I don't have any of these CPUs. I haven't even *seen* Ryzen in the flesh. You say that he's "missed some AVX2 instructions in the Ryzen section", but the instructions are there, it's just the numbers that are incorrect. As for MSVC, I'm not surprised. I don't even think it'd be considered a bug. If you're using AVX intrinsics, you really need to be telling the compiler to target AVX. Mixed-mode binaries just don't work. — Cody Gray - on strike, Jul 10 '17 at 07:53
What page of http://www.agner.org/optimize/instruction_tables.pdf has the Ryzen numbers for `VPBLENDD` then? Are you sure you didn't just text-search and get all the way to Haswell, which is the first place that string appears in the PDF? (The spreadsheet version has separate tabs for each uarch, so this doesn't happen. But the spreadsheet is missing even more Ryzen entries than the PDF.) — Peter Cordes, Jul 10 '17 at 10:21
Oh, I see what I did. I looked at the floating point `VBLENDPS/PD` entry and assumed it would be the same. Most of the other integer and floating-point instructions are comparable, but probably a bad assumption anyway. Looking closer, it also caught my eye that Fog has `VINSERTI128` down on Ryzen as having a reciprocal throughput of 0.67, while `VINSERTF128` is 0.5. I'm not sure why FP would be faster. It's probably just measuring error. I've honestly never used the spreadsheet version, since I don't have an app installed that can read that format. Would be easier to copy-paste from! @Peter — Cody Gray - on strike, Jul 10 '17 at 10:51
I was tempted to just look at the FP version too, but glad I checked something else. And yeah, I think the 0.5 for VINSERTF128 is an error, since it doesn't match the number of ports / uops, or the InstLatX64 real numbers. There are a few other wrong numbers in the tables, too, for other CPUs. LibreOffice works well for reading the spreadsheet version. — Peter Cordes, Jul 10 '17 at 10:55
It is worth noting that with clang 7 all versions generate optimal assembler. — chtz, Jan 26 '19 at 11:31
Thank you for maintaining this answer, @Peter. I really appreciate all the work you do to keep Stack Overfow accurate, up-to-date, and informative. — Cody Gray - on strike, Jun 14 '19 at 16:48

score 7 · Answer 2 · answered Jun 14 '19 at 09:04

7

A new intrinsic function has been added for solving this problem:

m2 = _mm256_zextsi128_si256(m1);

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_zextsi128_si256&expand=6177,6177

This function doesn't produce any code if the upper half is already known to be zero, it just makes sure the upper half is not treated as undefined.

answered Jun 14 '19 at 09:04

A Fog

4,360
1
30
32

This works for Clang and MS compilers, but not Gcc. – A Fog Jun 21 '19 at 18:43
1

[Fixed on GCC trunk](https://github.com/gcc-mirror/gcc/commit/e6b2dc248df351be58ecaa8bb5af8ec523d2530e#diff-729c6845acec256ecd4475c0ef044264). Guess we'll see them in GCC 10. – Nemo May 03 '20 at 19:38

Paul R · Answer 3 · 2014-01-27T16:53:41.193

4

See what your compiler generates for this:

__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_set_m128i(_mm_setzero_si128(), m1);

or alternatively this:

__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_setzero_si256();
m2 = _mm256_inserti128_si256 (m2, m1, 0);

The version of clang I have here seems to generate the same code for either (vxorps + vinsertf128), but YMMV.

edited Jan 27 '14 at 16:53

answered Jan 27 '14 at 16:30

Paul R

208,748
37
389
560

1

Paul: all my compilers (ICC 14, VC 17, GC 4.8.1) use vinserti128. With m2 = _mm256_castsi128_si256(m1) they all use a faster vmovdqa and clear the upper half, but I'm not sure if I can rely on that. – seda Jan 27 '14 at 17:57

How to clear the upper 128 bits of __m256 value?

3 Answers3

Linked