Load and store complex floats with Intel intrinsics in C

Question

I'm trying to do some calculations with complex floating point numbers, using the __m128 vector units. With __m128 I can store two complex floats, as each complex number consists of two floating point numbers, one real and one imaginary part.

So far, so good.

My problem arises when i must "collect" my answers into one complex float. Say I have two __m128 vectors, and four complex numbers stored in these two vectors. As an example, I can add two vectors (two and two floats) together using the _mm_add_ps intrinsic, but how do I "reduce" the two complex numbers in the result vector to one complex number (two floats) and store it in an array?

And similarly, if I want to grab a complex number from my array and store it twice inside a vector (the real part in the 1st and 3rd block, and the imaginary part in the 2nd and 4th block), how can I accomplish this?

score 0 · Answer 1 · edited May 23 '17 at 12:16

Don't store your complex numbers in interleaved/packed format in the first place, if you want to use SIMD on them. Store the real parts and imaginary parts in separate arrays, so you can do four complex multiplies in parallel without any shuffling (or slow horizontal operations like HSUBPS).

To directly answer the question: do the first stage of a horizontal sum: bring the high 64 down to the low 64 of another vector (with _mm_movehl_ps), and then _mm_add_ps, like shown in my answer on that question.

Then you can MOVLPS to store the low 2 floats: void _mm_storel_pi (__m64 *p, __m128 a). It looks like you'll need annoying casting to use it :/ MOVSD would also work, but takes one more byte to encode.

And similarly, if I want to grab a complex number from my array and store it twice inside a vector

Use MOVDDUP to broadcast 64 bits from memory or another register. You'll need some casting to use the intrinsics, but that's fine (they won't compile to any instructions, and using double instructions like MOVDDUP on float data has no penalty on any existing CPUs):

__m128d _mm_loaddup_pd(double const * dp);
__m128d _mm_movedup_pd(__m128d a);

At least it has a load intrinsic, unlike PMOVZX (this design flaw is one of my major pet peeves with intrinsics).

Load and store complex floats with Intel intrinsics in C

1 Answers1