1

I am trying to perform a right shift operation on a packed single vector using avx2 intrinsics in C++, and I cannot get it to work.

float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
auto vec = _mm256_load_ps(data);
auto vec2 = foo(vec); // use avx intrinsics to implement foo
_mm256_store_ps(data, vec2);

After doing this I would like data to contain the values

{X, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}

where X is an arbitrary value, I do not care if it corresponds to circular shift, zero-padding or some undefined value padding, as long as it is fast.

Can someone help with me with implementing foo efficiently using avx2 intrinsics?

DNF
  • 11,584
  • 1
  • 26
  • 40

1 Answers1

2

You should use _mm256_loadu_ps and _mm256_storeu_ps if you haven't explicitly defined your array of floats 'data' to be 32 byte aligned. You can use _mm256_permutevar8x32_ps() to rotate the data right by 4 bytes. Check out https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=AVX_ALL&ig_expand=6144,4986 for a useful reference of SIMD intrinsics. I think something like this should do the trick pretty efficiently. I'm still on old hardware which doesn't have AVX2 so I can't test this, but the idea is valid if the actual code isn't. :D

__m256i idx = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
vec2 = _mm256_permutevar8x32_ps(vec, idx);
Simon Goater
  • 759
  • 1
  • 1
  • 7
  • Thanks. That's great! Do you think if I use `idx = _mm256_setr_epi32(0, 0, 1, 2, 3, 4, 5, 6);` it will repeat the first value? I cannot test this right now. – DNF May 30 '23 at 16:40
  • From what I can tell from the description of the intrinsic, you can put any of 0-7 in any position so yes. The numbers for idx might be wrong. If the function in the other answer is correct it probably should be (1, 2, 3, 4, 5, 6, 7, 0). You'll need to try it out to be sure. Please let me know which works. – Simon Goater May 30 '23 at 16:47
  • The answer is correct and works as advertised. Also, `idx = _mm256_setr_epi32(0, 0, 1, 2, 3, 4, 5, 6);` works well. – DNF May 30 '23 at 19:16