3

I am doing vectorization using AVX intrinsics, I want to fill constant floats like 1.0 into vector __m256. So that in one register I got a vector{1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0} Does anyone knows how to do it?

It is similar to this question constant float with SIMD

But I am using AVX not SSE

Community
  • 1
  • 1
Lbj_x
  • 415
  • 5
  • 15

3 Answers3

7

The simplest method is to use _mm256_set1_ps:

__m256 v = _mm256_set1_ps(1.0f);
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    Note that the compiler *could* use the integer sequence from Lu'u's answer if it wanted to (and you were compiling with AVX2). It could also compile a `set1` to a broadcast instruction, instead of a 256b load. If you really want to, you can make the decision for the compiler by using one of the other two answers (to force a load or on-the-fly generation). – Peter Cordes Jan 26 '16 at 13:02
  • Thanks you @Paul R, in fact finally I choose to use your answer, however, the guy nabla give me a quick answer, so I give him the answer hope you understand:D – Lbj_x Jan 26 '16 at 13:15
  • @Lbj_x: no problem - all the answers are valid and potentially useful for future readers. – Paul R Jan 26 '16 at 13:27
  • I think at some point some compilers converted `_mm256_set1_ps` to two instructions whereas all compilers use one instruction for `_mm256_broadcast_ps`. – Z boson Jan 26 '16 at 17:31
  • 2
    [here](http://stackoverflow.com/a/13222222/2542702) is where I read this. Maybe Clang has fixed that now. – Z boson Jan 27 '16 at 07:24
5

See here for the AVX intrinsics load and store operations. You simply need to declare, a float array, an AVX vector __m256, and then use the appropriate operation to load the float array as an AVX vector.

In this case, the instruction _mm256_load_ps is what you want.

Update: As mentioned in the comments, the data must be 32 bit aligned. See Intel data alignment documentation for a detailed explanation. I've made the solution code cleaner, as per Peter's comments. With optimisation enabled (-O3), this produces the same code as Paul's answer (also with optimisation enabled). Without optimisations enabled, however, the number of instructions are the same, but all 8 floating point numbers are stored, rather than a single floating point answer as in Paul's answer.

Here is the modified example:

#include <immintrin.h> // For AVX instructions

#ifdef __GNUC__
  #define ALIGN(x) x __attribute__((aligned(32)))
#elif defined(_MSC_VER)
  #define ALIGN(x) __declspec(align(32))
#endif

static constexpr ALIGN(float a[8]) = {1.0f,1.0f,1.0f,1.0f,1.0f,1.0f,1.0f,1.0f};

int main() {
  // Load the float array into an avx vector
  __m256 vect = _mm256_load_ps(a);
}

You can easily check the assembly output with a few compilers by using the Godbolt interactive C++ compiler.

RobClucas
  • 815
  • 7
  • 16
  • You need to align the array to 32-byte boundary or the program will crash – phuclv Jan 26 '16 at 11:21
  • 2
    @Lbj_x: Paul R's answer is better: tell the compiler what you *really* want, rather than how to do it. It will use whatever tricks (like broadcast instructions) are available / applicable. **You absolutely don't want to force it into making a *local* array, rather than loading from a global/static constant.** Actually making the local array would mean storing to the stack. Also note that gcc compiling for win32 will choke on this code, because it's not Unix vs. Windows, it's GNU C (`__attribute__`) vs. MSVC (`__declspec`). – Peter Cordes Jan 26 '16 at 12:37
  • @PeterCordes, I agree that Paul R's answer is better, hence the upvote, and agree with your comment, especially the loading from a global/static constant. The use of Windows vs Unix was taken from the referenced intel alignment documentation. – RobClucas Jan 26 '16 at 12:44
  • @nabla: I might be mistaken then. IDK if mingw, or some other case of gcc compiling for win32 would define `_WIN32` without being able to compile `__declspec`, but I think it's likely. I'd probably do `#ifdef __GNUC__` (since all major compilers except MSVC support at least basic GNU C extensions, I think). I'd also avoid duplicating the array initializer (just ifdef the declaration, or a declare_aligned macro or something). – Peter Cordes Jan 26 '16 at 12:54
  • Also, depending on what the compiler does, this might be a better answer if you make the array static, instead of local, so you don't get the worst of both worlds. Then I'd remove my downvote. – Peter Cordes Jan 26 '16 at 12:55
  • @PeterCordes I agree, I wasn't trying to demonstrate optimal code, merely the illustration of using the instruction. I will edit just now. – RobClucas Jan 26 '16 at 12:57
  • 2
    change my vote to an upvote :) I put *this* code on godbolt: https://goo.gl/qIXGlK. I tossed in some other ways of doing it. The link in your question is to some other code on godbolt. gcc compiles them differently, but clang even sees through the static array and uses `vbroadcastss`. `main` is a poor choice for testing code on godbolt: gcc marks it as "cold" and optimizes less. Also, writing functions to take args and/or return a result is *easier* than worrying about `main` optimizing to `return 0;`. Remember, we just want to look at the asm, not run it. – Peter Cordes Jan 26 '16 at 15:50
  • @PeterCordes thanks, that's very interesting. I'll update answer with link to your Godbolt code. – RobClucas Jan 26 '16 at 15:59
5

You can use this without a const array

pcmpeqw xmm0, xmm0
pslld   xmm0, 25
psrld   xmm0, 2

See the way to make other constants in Agner Fog's optimization guide, 13.10 Generating constants - Making floating point constants in XMM registers

pcmpeqw xmm0, xmm0 ; 1.5f
pslld   xmm0, 24
psrld   xmm0, 2

pcmpeqw xmm0, xmm0 ; -2.0f
pslld   xmm0, 30

See also

phuclv
  • 37,963
  • 15
  • 156
  • 475