2

I have some code that operates on 4D vectors and I'm currently trying to convert it to use SSE. I'm using both clang and gcc on 64b linux.
Operating only on vectors is all fine -grasped that. But now comes a part where i have to multiply an entire vector by a single constant - Something like this:

float y[4];
float a1 =   25.0/216.0;  

for(j=0; j<4; j++){  
    y[j] = a1 * x[j];  
} 

to something like this:

float4 y;
float a1 =   25.0/216.0;  

y = a1 * x;  

where:

typedef double v4sf __attribute__ ((vector_size(4*sizeof(float)))); 

typedef union float4{
    v4sf v;
    float x,y,z,w;
} float4;

This of course will not work because I'm trying to do a multiplication of incompatiple data types.
Now, i could do something like:
float4 a1 = (v4sf){25.0/216.0, 25.0/216.0, 25.0/216.0, 25.0/216.0} but just makes me feel silly, even if if i write a macro to do this. Also, I'm pretty certain that will not result in very efficient code.

Googling this brought no clear answers ( see Load constant floats into SSE registers).

So what is the best way to multiply an entire vector by the same constant?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Emanuel Ey
  • 2,724
  • 5
  • 30
  • 38

3 Answers3

10

Just use intrinsics and let the compiler take care of it, e.g.

__m128 vb = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f); // vb = { 1.0, 2.0, 3.0, 4.0 }
__m128 va = _mm_set1_ps(25.0f / 216.0f); // va = { 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f, 25.0f / 216.0f }
__m128 vc = _mm_mul_ps(va, vb); // vc = va * vb

If you look at the generated code it should be quite efficient - the 25.0f / 16.0f value will be calculated at compile time and _mm_set1_ps generates usually generates reasonably efficient code for splatting a vector.

Note also that you normally only initialise a constant vector such as va just once, prior to entering a loop where you will be doing most of the actual work, so it tends not to be performance-critical.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • So how do i access indiviual element of `__m128`? shall i use a union like with `v4sf`? – Emanuel Ey Mar 11 '11 at 14:28
  • 1
    @Emanuel: You normally don't want to do this - the whole point of SIMD is to access large numbers of contiguous elements sequentially using e.g. `_mm_load_ps`, `_mm_store_ps` plus whatever logical/arithmetic SIMD operations you need in a loop. But in rare cases where you need to manipulate individual elements then yes, use a union. – Paul R Mar 11 '11 at 14:40
  • @Paul R: ok, I asked because I need test values of individual elements after the calculation has been done. Thanks. – Emanuel Ey Mar 11 '11 at 14:48
  • 1
    @Emanuel: try and do this using SIMD operations rather than extracting scalar values and testing these, otherwise you may have a performance bottleneck. E.g. use a SIMD comparison and mask out the elements that you are not interested in. If you post your scalar code in a new question then I and others can perhaps guide you in vectorizing it. – Paul R Mar 11 '11 at 14:56
  • @Paul R:See also: [performance of intrinsic functions with sse](http://stackoverflow.com/questions/5276825/performance-of-intrinsic-functions-with-sse) – Emanuel Ey Mar 11 '11 at 18:27
  • 1
    @PaulR, I just realized you answered this question a while back. A lot has probably changed with GCC since then but I just posted what I think is a better answer with GCC today. – Z boson Dec 08 '13 at 20:42
  • @PaulR, in this context, what do you mean by "splatting a vector"? – Jedi Jan 26 '17 at 09:43
  • 1
    @Jedi: "splatting" in this context means setting all elements of a vector to the same value, usually by copying one element to all the other elements. The terminology comes from AltiVec (PowerPC SIMD architecture) which has a `vec_splat` instruction/intrinsic. – Paul R Jan 26 '17 at 10:31
5

There is no reason one should have to use intrinsics for this. The OP just wants to do a broadcast. That's as basic a SIMD operation as SIMD addition. Any decent SIMD library/extension has to support broadcasts. Agner Fog's vector class certainly does, OpenCL does, the GCC documention clearly shows that it does.

a = b + 1;    /* a = b + {1,1,1,1}; */
a = 2 * b;    /* a = {2,2,2,2} * b; */

The following code compiles just fine

#include <stdio.h>
int main() {     
    typedef float float4 __attribute__ ((vector_size (16)));

    float4 x = {1,2,3,4};
    float4 y = (25.0f/216.0f)*x;
    printf("%f %f %f %f\n", y[0], y[1], y[2], y[3]);
    //0.115741 0.231481 0.347222 0.462963
}

You can see the results at http://coliru.stacked-crooked.com/a/de79cca2fb5d4b11

Compare that code to the intrinsic code and it's clear which one is more readable. Not only is it more readable it's easier to port to e.g. ARM Neon. It also looks very similar to OpenCL C code.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    On the other hand, for portability to other compilers, e.g. ICC, MSVC, older versions of gcc, etc, intrinsics are a better bet, even if they are less readable - the choice depends on your particular application and portability requirements. – Paul R Dec 08 '13 at 21:45
  • Well Clang supports the GCC extensions (among others). But I think the best bet is to use Agner Fog's vector class. It works with all those compilers. Then you get the cleaner code as well. It's too bad C/C++ has not adopted simd types (e.g. float4 in OpenCL) as a basic type by now. It should be a basic type just like float and double since most hardware supports it now as a basic type. – Z boson Dec 08 '13 at 22:03
  • 2
    Apart from compiler support, you also need to consider that there are many important intrinsics which do not map well to a more generic model, e.g. `_mm_madd_epi16`. If you're doing relatively simple straightforward stuff, especially if it's just float/double, then some kind of abstraction layer or compiler extension soon might be a good way to go, but it's not a panacea - there are many use cases where intrinsics (or even asm) are a better choice. – Paul R Dec 08 '13 at 22:07
  • Yes, I totally agree on that. Although, so far there have been very few cases where the Vector Class could not do what I wanted but `mm_madd_ep16` is one of them. – Z boson Dec 08 '13 at 22:08
  • 1
    [Agner Fog's VCL](http://agner.org/optimize/#vectorclass) is compatible with Intel intrinsics with no extra casts, just use `_mm_madd_epi16` on a `Vec8s` and assign the result to another `Vec8s`; it has operator overloads and copy constructors to make this work without syntax pain. You might need an explicit cast to `__m128i` when using GNU C native vectors, but `__m128i` is defined in terms of a native vector (of `long long`) so it should be possible. – Peter Cordes Apr 16 '18 at 19:56
  • @PeterCordes, I'm aware of that and have used it but the VCL is limited to x86 which is one reason I like the vector extensions. – Z boson Apr 17 '18 at 07:08
  • Yeah, if you're doing something that doesn't need any shuffles or ISA-specific manual vectorization, you can get good results from GNU C vector extensions. I guess you could use them to unroll with multiple vector accumulators when it's hard to get auto-vectorization to do that, or maybe smart unaligned loads for first / last vectors instead of auto-vectorization stupid scalar prologue/epilogue. – Peter Cordes Apr 17 '18 at 07:10
1

This perhaps might not be the best way but this was the approach I took when I was dabbling around in SSE.

float4 scale(const float s, const float4 a)
{
  v4sf sv = { s, s, s, 0.0f };
  float4 r = { .v = __builtin_ia32_mulps(sv, a.v) };
  return r;
}

float4 y;
float a1;

y = scale(a1, y);