Left-shift (of float32 array) with AVX2 and filling up with a zero

Question

I have been using the following "trick" in C code with SSE2 for single precision floats for a while now:

static inline __m128 SSEI_m128shift(__m128 data)
{
    return (__m128)_mm_srli_si128(_mm_castps_si128(data), 4);
}

For data like [1.0, 2.0, 3.0, 4.0], it results in [2.0, 3.0, 4.0, 0.0], i.e. it does a left shift by one position and fills the data structure with a zero. If I remember correctly, the above inline function compiles down to a single instruction (with gcc at least).

I am somehow failing to wrap my head around doing the same with AVX2. How could I achieve this in an efficient manner?

Similar questions: 1, 2, 3

If you're using `gcc`, I recommend using gcc vector extensions instead of architecture-specific intrinsics where possible. In particular, you can use `__builtin_shuffle(data, (fvectype){0}, (ivectype){1, 2, 3, 4})`. Be aware though, that AVX-vectors of more than 128 bits are composed of *lanes*, and lane-crossing instructions (which are unavoidable when extending your example straightforwardly), are a fair bit slower than in-lane operations (~3 times slower), so it may be a good idea to review whether you actually need this. — EOF, May 23 '20 at 13:33
@EOF Thanks for the pointer. If I was planning on using the arch-specific intrinsics, do you have any idea about how to do what I want? :) — s-m-e, May 23 '20 at 15:56
Sure. `gcc` compiles the gcc vector intrinsics to the following assembly: `vmovaps %ymm0, %ymm1 vxorps %xmm0, %xmm0, %xmm0 vperm2f128 $33, %ymm0, %ymm1, %ymm0 vpalignr $4, %ymm1, %ymm0, %ymm0`. You can reverse-engineer that into intel-intrinsics if you like. Alternatively, a sane solution would be gcc vector intrinsics. — EOF, May 23 '20 at 16:13
You're welcome. In case you want to do this right, here's a [godbolt link](https://godbolt.org/z/hcLMQQ) with the implementation. — EOF, May 23 '20 at 16:33
@EOF: another way to do this shuffle (which would better in a loop where you can load vector constants once outside the loop): `vpermd` to do a lane-crossing shuffle with 32-bit elements, `vpblendd` to blend in a `0.0` element where you want it. — Peter Cordes, May 23 '20 at 18:33
@PeterCordes Well, `gcc` [seems to](https://godbolt.org/z/KbWepy) like the code just as well in a loop. Could you explain how `vpermd/vpblendd` would be preferable? Agner Fog and uops.info show `vpalignr` to be fast, it apparently doesn't count as a lane-crossing instruction. — EOF, May 23 '20 at 18:41
@EOF: `vpermd` costs the same as `vperm2f128` on Intel hardware, maybe somewhat less on Zen 1. `vpblendd` is 1 uop for *any* vector ALU port on Intel, so it avoids a potential shuffle-port bottleneck from `vperm2f128` + `vpalignr`. vpalignr is just an in-lane shuffle, that's *why* we need vperm2f128 to set up for it. — Peter Cordes, May 23 '20 at 18:44
@PeterCordes Hmm, ok. Though `vpalignr` being in-lane is (to me) not at all obvious, since it moves data from the low lane of at least one input to the high lane of the output. — EOF, May 23 '20 at 18:49
@EOF: No it doesn't, that's why it's so hard to use / such a bad design for extending `palign` to 256 bits, and why GCC needed `vperm2f128`. See the 256-bit diagram in https://www.felixcloutier.com/x86/palignr — Peter Cordes, May 23 '20 at 18:52
@PeterCordes Ohhhh, the lanes from sources are effectively *rotated* into the corresponding lane of the destination! That's... not great. Well, at least I now seem to understand the instruction, so thank you for that as well. — EOF, May 23 '20 at 19:00
gcc and clang optimize this to a `vpermd`/`vpermps` and a `vblendps`: https://godbolt.org/z/RW7_ds which requires a shuffle vector (technically, one could use the same vector for the blend as well). It should be possible with just a `vperm2f128` and a `vpalignr` (the `vperm2f128` can set one half to 0) -- requiring no shuffle vector, but two operations on p5. — chtz, May 23 '20 at 19:25
@chtz That's the opposite direction. [This](https://godbolt.org/z/pBNApN) would be the right direction, and it's ok for clang, but gcc doesn't like that at all. — EOF, May 23 '20 at 20:08
@EOF, argh.. you are right -- I still sometimes get confused with left and right (also OP actually appears to require a right shift, despite the title saying "left-shift" ...) — chtz, May 23 '20 at 20:17

Left-shift (of float32 array) with AVX2 and filling up with a zero

0 Answers0

Linked