6

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.

I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...

I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?

union {
  __m128d v;
  double d[2];
} vec;
union {
  __m128d v;
double d[2];
} vec2;

vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));

also, the two unions look ridiculously ugly, but when using

union dvec {
  __m128d v;
  double d[2];
} vec;

Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
the_toast
  • 175
  • 1
  • 7
  • Have you looked into OpenCL? Both AMD's and Intel's OpenCL implementations can emit code for the x86 CPU and support double precision floating point, and use the SSE registers. I've never done it, but I suspect you could write a trivial OpenCL program that evaluates your expression and you could look at the generated code and see how their tools would do it. – K. Brafford Sep 19 '12 at 13:43

3 Answers3

7

Unfortunately if you look at MSDN it says the following:

You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.

I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.

EDIT:

I've just found this, and it says:

Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".

It also says:

Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.

So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.

EDIT2:

Here is another interesting resource that describes with examples on how to use these SIMD types

In the above link, you'll find this line:

#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
    return x[0];
}

In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.

Tony The Lion
  • 61,704
  • 67
  • 242
  • 415
  • hey tony, thanks for your answer! i read here [link]http://stackoverflow.com/questions/1771945/c-how-to-access-elements-of-vector-using-gcc-sse-vector-extension that this is the "recommended" way. but i'm concerned that it might be slower than necessary accessing elements this way. – the_toast Sep 19 '12 at 13:31
  • @the_toast I'm afraid that's probably the only way to do this. SIMD is already instrinsic instructions so you're fairly limited as to what you can do. Read some of the stuff I've linked to, it may help making sense of it. :) – Tony The Lion Sep 19 '12 at 13:32
  • @the_toast I stand corrected, with a GCC extension (see my edit) you may have another option. – Tony The Lion Sep 19 '12 at 13:35
  • 1
    @the_toast It might be useful to know that support for directly accessing SIMD elements was left out of the SSE intrinsic interface intentionally because (prior to SSE4.1) there is no efficient way to do it in the hardware. You're "supposed" to do it using unpack and move instructions - type-punning will often cause expensive store-to-load stalls due to immediate accessing of memory from a different word size it was written to. – Mysticial Sep 19 '12 at 14:43
1

_mm_cvtsd_f64 + _mm_unpackhi_pd

For doubles:

#include <assert.h>

#include <x86intrin.h>

int main(void) {
    __m128d x = _mm_set_pd(1.5, 2.5);
    /* _mm_cvtsd_f64 + _mm_unpackhi_pd */
    assert(_mm_cvtsd_f64(x) == 2.5);
    assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}

For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function

  • _mm_cvtss_f32 + _mm_shuffle_ps
  • _MM_EXTRACT_FLOAT

For ints you can use _mm_extract_epi32:

#include <assert.h>

#include <x86intrin.h>

int main(void) {
    __m128i x = _mm_set_epi32(1, 2, 3, 4);
    assert(_mm_extract_epi32(x, 3) == 4);
    assert(_mm_extract_epi32(x, 2) == 3);
    assert(_mm_extract_epi32(x, 1) == 1);
    assert(_mm_extract_epi32(x, 0) == 1);
}

GitHub upstream.

Compile and run examples with:

gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out

Tested on Ubuntu 19.04 amd64.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • For storing the high half, there's [`movhpd`](https://www.felixcloutier.com/x86/movhpd) (or equivalent but smaller machine code `movhps`). `_mm_storeh_pd (double *p, __m128d a)`. But if you want to do more with the scalar double like compare it, not just store it, then yes shuffle with `unpckhpd`. – Peter Cordes Feb 19 '22 at 00:42
0

There is a double _mm_cvtsd_f64 (__m128d a) function in defined in "emmintrin.h" to access the lower double of an sse vector of two doubles.

From the Intel Intrinsics guide:

Synopsis

  • double _mm_cvtsd_f64 (__m128d a)
  • include "emmintrin.h"
  • Instruction: movsd
  • CPUID Feature Flag: SSE2

Description: Copy the lower double-precision (64-bit) floating-point element of a to dst.

Operation dst[63:0] := a[63:0]