12

The _mm_set_epi64 and similar *_epi64 instructions seem to use and depend on __m64 types. I want to initialize a variable of type __m128 such that the upper 64 bits of it are 0, and the lower 64 bits of it are set to x, where x is of type uint64_t (or similar unsigned 64-bit type). What's the "right" way of doing so?

Preferably, this should be done in a compiler-independent manner.

Gideon
  • 433
  • 4
  • 15

3 Answers3

10

To answser your question about how to load a 64-bit value into the lower 64-bits of a XMM register while zeroing the upper 64-bits _mm_loadl_epi64(&x) will do exactly what you want.

In regards to _mm_set_epi64 I said once that looking at the source code of Agner Fog's Vector Class Library can answer 95% of the question on SSE/AVX on SO. Agner implemented this (from the file vectori128.h) for multiple compilers and for 64-bit and 32-bit. Note that the solution for MSVC 32-bit Agner says "this is inefficient, but other solutions are worse". I guess that's what Mysticial means by "There isn't a good way to do it.".

Vec2q(int64_t i0, int64_t i1) {
#if defined (_MSC_VER) && ! defined(__INTEL_COMPILER)
        // MS compiler has no _mm_set_epi64x in 32 bit mode
#if defined(__x86_64__)                                    // 64 bit mode
#if _MSC_VER < 1700
        __m128i x0 = _mm_cvtsi64_si128(i0);                // 64 bit load
        __m128i x1 = _mm_cvtsi64_si128(i1);                // 64 bit load
        xmm = _mm_unpacklo_epi64(x0,x1);                   // combine
#else
        xmm = _mm_set_epi64x(i1, i0);
#endif
#else   // MS compiler in 32-bit mode
        union {
            int64_t q[2];
            int32_t r[4];
        } u;
        u.q[0] = i0;  u.q[1] = i1;
        // this is inefficient, but other solutions are worse
        xmm = _mm_setr_epi32(u.r[0], u.r[1], u.r[2], u.r[3]);
#endif  // __x86_64__
#else   // Other compilers
        xmm = _mm_set_epi64x(i1, i0);
#endif
};
Z boson
  • 32,619
  • 11
  • 123
  • 226
8

The most common "standard" intrinsic for this is _mm_set_epi64x.

For platforms that lack _mm_set_epi64x you can define a replacement macro like this:

#define _mm_set_epi64x(m0, m1) _mm_set_epi64(_m_from_int64(m0), _m_from_int64(m1))
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    And for anyone who cares about 32-bit anymore, this intrinsic only exists on x64. To target 32-bit, a different approach will be needed. – Mysticial May 05 '14 at 19:34
  • 2
    @Mysticial If you don't mind adding an answer on how to do that, that might be helpful for someone coming after me. – Gideon May 05 '14 at 19:41
  • 3
    @Gideon There *isn't* a good way to do it. It's better to avoid these set intrinsics in the first place. The only place I find them acceptable would be compile-time constants - in which case, you just manually split the 64-bit integer into it's halves and use `_mm_set_epi32()`. – Mysticial May 05 '14 at 19:44
  • 1
    @Gideon, I posted an answser to your question. – Z boson May 06 '14 at 14:03
4

I want to initialize a variable of type __m128 ... where x is of type uint64_t

The intrinsic which takes the uint64_t is _mm_set_epi64x (as opposed to _mm_set_epi64, which takes a __m64).

I recently ran into the issue on Solaris. Sun Studio 12.3 and below lacks _mm_set_epi64x. It also lacks the work-arounds, like _mm_cvtsi64_si128 and _m_from_int64.

Here's the hack I used, if interested. The other option was to disable SSE2, which was not too appealing (and it was 3x slower in benchmarks):

// Sun Studio 12.3 and earlier lack SSE2's _mm_set_epi64 and _mm_set_epi64x.
#if defined(__SUNPRO_CC) && (__SUNPRO_CC < 0x5130)
inline __m128i _mm_set_epi64x(const uint64_t a, const uint64_t b)
{
    union INT_128_64 {
        __m128i   v128;
        uint64_t  v64[2];
    };

    INT_128_64 v;
    v.v64[0] = b; v.v64[1] = a; 
    return v.v128;
}
#endif

I believe C++11 could do additional things to help the compiler and performance, like initialize a constant array:

const INT_128_64 v = {a,b};
return v.v128;

There's a big caveat... I believe there is undefined behavior because a write occurs using the v64 member of the union, and then read occurs using the v128 member of the union. Testing under SunCC shows the compiler is doing the expected (but technically incorrect) thing.

I believe you can sidestep the undefined behavior using a memcpy, but that could crush performance. Also see Peter Cordes' answer and discussion at How to swap two __m128i variables in C++03 given its an opaque type and an array?.

The following may also be a good choice to avoid the undefined behavior from using the inactive union member. But I'm not sure about the punning.

INT_128_64 v;
v.v64[0] = b; v.v64[1] = a;
return *(reinterpret_cast<__m128i*>(v.v64));

EDIT (three months later): Solaris and SunCC did not like the punning. It produced bad code for us, and we had to memcpy the value into __m128i. Unix, Linux, Windows, GCC, Clang, ICC, MSC were all OK. Only SunCC gave us trouble.

Community
  • 1
  • 1
jww
  • 97,681
  • 90
  • 411
  • 885
  • Type-punning with unions is preferable to pointer-casts. They're both undefined behaviour according to the standard, but the union is safe with gcc at least. The pointer-cast technique isn't safe with real compilers. (except maybe with SIMD `__m128` types, which are defined with a may_alias attribute or something like that. Hopefully SunCC defines it simlarly.) – Peter Cordes Jul 24 '16 at 03:51
  • It's worth trying memcpy if you're not sure that union-based type-punning is safe. Some compilers are good about optimizing it away, but as you say, I showed that not all make acceptable code. `memcpy` is AFAIK the only type-punning technique that is guaranteed to be portable by ISO C and C++, but union-based type-punning is widely used in real life as I understand it. – Peter Cordes Jul 24 '16 at 03:53
  • @PeterCordes - Looking at the SSE2 [`_mm_loadl_pi`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_loadl&expand=3079), it may be a suitable replacement. It allows the load of an unaligned 64-bit value. Using `_mm_loadl_pi` twice with an intermediate shift may avoid some of the theoretical problems. – jww Jul 24 '16 at 04:24
  • If you actually want the compiler to emit load instructions, instead of `movd xmm0, eax` or `movq xmm0, rax` or something, then maybe. But you should definitely not use it twice! The intrinsics that look to most closely match the asm you'd want are `__m128i _mm_loadu_si64 (void const* mem_addr)` (to load the low 64 bits with `movq xmm0, m64`, zeroing the upper 64 of the xmm). Or on compilers which don't support it, `_mm_loadl_epi64` which also says it compiles to a `movq` load. Then `_mm_loadh_pi` to load the upper half with `movhps`. Using a shift intrinsic would be silly. – Peter Cordes Jul 24 '16 at 04:45
  • Anyway, `movq` / `movl/h` could be good for a pair of non-adjacent 64-bit values. BTW, no instructions have alignment requirements for operands of 64-bit or smaller; you don't need a special intrinsic for that. I might also be ok to avoid a store-forwarding-failure stall on a pair of adjacent recently-written 64bit values, instead using `_mm_loadu_si128`. (But pointer-casting for `_mm_loadu` should be aliasing-safe. Possibly your cast version is actually safe, but I think I've seen an SO question where something like that didn't do what the OP wanted.) – Peter Cordes Jul 24 '16 at 04:50
  • update on my earlier comments: Union-based type punning is guaranteed to work in C99/C11, but *not* C++. Casting `*(__m128i*)&v64[0]` should actually be safe, because __m128 types are special and are allowed to alias, unlike say `*(double *)&v64[0]`. For example, gcc defines __m128i with `__attribute__((may_alias))`, unlike the internal v2qi native-vector type it's based on. I've never liked `memcpy` for type punning, but maybe compilers are good enough at seeing through it that you don't get bad code in practice. I forget if I've ever seen an example of it not optimizing away. – Peter Cordes Oct 04 '16 at 13:42