I am trying to perform a right shift operation on a packed single vector using avx2 intrinsics in C++, and I cannot get it to work.
float data[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f};
auto vec = _mm256_load_ps(data);
auto vec2 = foo(vec); // use avx intrinsics to implement foo
_mm256_store_ps(data, vec2);
After doing this I would like data
to contain the values
{X, 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f}
where X
is an arbitrary value, I do not care if it corresponds to circular shift, zero-padding or some undefined value padding, as long as it is fast.
Can someone help with me with implementing foo
efficiently using avx2 intrinsics?