5

I'm trying to find a more efficient way to "rotate" or shift the 32 bit floating point values within an avx _m256 vector to the right or left by one place.

Such that:

a7, a6, a5, a4, a3, a2, a1, a0

becomes

0, a7, a6, a5, a4, a3, a2, a1

(I dont mind if the data gets lost as I replace the cell anyway.)

I've already taken a look at this thread: Emulating shifts on 32 bytes with AVX but I don't really understand what is going on, and it doesn't explained what the _MM_SHUFFLE(0, 0, 3, 0) does as an input parameter.


I'm trying to optimise this code:

_mm256_store_ps(temp, array[POS(ii, jj)]);
_mm256_store_ps(left, array[POS(ii, jj-1)]);

tmp_array[POS(ii, jj)] = _mm256_set_ps(left[0], temp[7], temp[6], temp[5], temp[4], temp[3], temp[2], temp[1]);

I know once a shift is in place, I can use an insert to replace the remaining cell. I feel this will be more efficient then unpacking into a float[8] array and reconstructing.

-- I'd also like to be able to shift both left and right, as I need to perform a similar operation elsewhere.

Any help is greatly appreciated! Thanks!

Community
  • 1
  • 1
MishMash95
  • 753
  • 7
  • 9
  • 2
    We call that a shuffle, since you're moving elements around, not shifting or rotating bits inside elements. – Peter Cordes Nov 25 '16 at 12:31
  • Thanks for the terminology correction, have renamed question appropriately! – MishMash95 Nov 25 '16 at 12:32
  • 1
    What is the immediately preceding code which populates what will then be shifted, and what's the immediately following code which will use the shifted result? Maybe the outer parts can be integrated into the optimization effort. – John Zwinck Nov 25 '16 at 12:32
  • Do you need a non-AVX2 version of this? For AVX2, just use [`VPERMPS`](http://www.felixcloutier.com/x86/VPERMPS.html) with a shuffe-mask in a register. It can do arbitrary lane-crossing shuffles. – Peter Cordes Nov 25 '16 at 12:35
  • The only proceeding code that comes before is the definition of a few float arrays used by mm256_store_ps. (__attribute__ ((aligned (32))) float temp[8], up[8], down[8], left[8], right[8];) The code after is quite long, however it is a series of mathematical expressions. The code performs a simulation of lattice boltzman fluid dynamics. Originally it was performed on a per-cell basis (with each cell being a float). I optimised this by performing 8 iterations simultaneously with the AVX instructions. This stage was transferring the flow from neighbouring cells. – MishMash95 Nov 25 '16 at 12:36
  • @PeterCordes I believe I have support for AVX2, Could you please elaborate on how to use that function? in my case, if I wanted to shift each element by 1, would the offset mask just be a series of 1's? Thanks – MishMash95 Nov 25 '16 at 12:37
  • John was asking where the input vector comes from. (e.g. is it the result of another computation, or does it typically start out in memory?) – Peter Cordes Nov 25 '16 at 12:37
  • @PeterCordes John Zwinck the input vector is stored in an array of vectors (of type __m256) called array[..]. The output gets stored in another array of vectors called tmp_array[..] The vector temp is the cell we are currently working on in the grid, left is the one to the left of that (jj-1). (My axis are flipped so jj represents the horizontal, or row-major direction.) Though, the data is not contiguous for all cases. Thanks! – MishMash95 Nov 25 '16 at 12:40
  • It's the same as any other variable-shuffle. Each element holds a 0..7 index in this case. Click the link in my previous comment to see Intel's description of how it works. Use `_mm256_set_epi32( 0,7,6,5,4,3,2,1)` to rotate left. See also http://stackoverflow.com/questions/37084379/convert-mm-shuffle-epi32-to-c-expression-for-the-permutation if you're having trouble understanding how shuffle indices work. – Peter Cordes Nov 25 '16 at 12:42
  • @PeterCordes Ah, makes sense! Thanks for the help, will give it a go! – MishMash95 Nov 25 '16 at 12:43
  • Wait a minute: *(I dont mind if the data gets lost as I replace the cell anyway.)*? That's just if you were using it as setup for an insert, right? You should probably take out (or isolate to one section) some of the stuff that assumes the solution is going to work that way. (Especially since inserting a single element into a 256b vector is non-trivial, unlike for 128b vectors where you can use INSERTPS). – Peter Cordes Nov 25 '16 at 13:26
  • If I understand the question, a simple call to `memmove()` (which allows overlapping parameters) should do the job, similar to: memmove( array[1], array[0], sizeof(array)-sizeof(array[0]);` – user3629249 Nov 26 '16 at 23:43

1 Answers1

8

For AVX2:

Use VPERMPS to do it in one lane-crossing shuffle instruction.

rotated_right = _mm256_permutevar8x32_ps(src, _mm256_set_epi32(0,7,6,5,4,3,2,1));

For AVX (without AVX2)

Since you say the data is coming from memory already, this might be good:

  • use an unaligned load to get the 7 elements to the right place, solving all the lane-crossing problems.
  • Then blend the element that wrapped around into that vector of the other 7.
  • To get the element that wrapped in-place for the blend, maybe use a broadcast-load to get it to the high position. AVX can broadcast-load in one VBROADCASTPS instruction (so set1() is cheap), although it does need the shuffle port on Intel SnB and IvB (the only two Intel microarchitectures with AVX but not AVX2). (See perf links in the tag wiki.

INSERTPS only work on XMM destinations, and can't reach the upper lane.

You could maybe use VINSERTF128 to do another unaligned load that ends up putting the element you want as the high element in the upper lane (with some don't-care vector in the low lane).

This compiles, but isn't tested.

__m256 load_rotr(float *src)
{
#ifdef __AVX2__
    __m256 orig = _mm256_loadu_ps(src);
    __m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
    return rotated_right;
#else
    __m256 shifted = _mm256_loadu_ps(src + 1);
    __m256 bcast = _mm256_set1_ps(*src);
    return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}

See the code + asm on Godbolt

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 3
    Thanks again for the help! I tried out the AVX2 method and it worked fantastically running locally on my own machine. Unfortunately, I realised that my distribution environment only had regular AVX. The other method using loadu worked, however it was around 30% slower than what I was already doing when implemented. Though I have learnt a lot from your reply, so thank you for your time :)! – MishMash95 Nov 25 '16 at 14:24