3

What is the best practice for swapping __m128i variables?

The background is a compile error under Sun Studio 12.2, which is a C++03 compiler. __m128i is an opaque type used with MMX and SSE instructions, and its usually and unsigned long long[2]. C++03 does not provide the support for swapping arrays, and std:swap(__m128i a, __m128i b) fails under the compiler.


Here are some related questions that don't quite hit the mark. They don't apply because std::vector is not available.

Community
  • 1
  • 1
jww
  • 97,681
  • 90
  • 411
  • 885

2 Answers2

2

swap via memcpy?

#include <emmintrin.h>
#include <cstring>

template<class T>
void memswap(T& a, T& b)
{
    T t;
    std::memcpy(&t, &a, sizeof(t));
    std::memcpy(&a, &b, sizeof(t));
    std::memcpy(&b, &t, sizeof(t));
}

int main() {
    __m128i x;
    __m128i y;
    memswap(x, y);
    return 0;
}
Leon
  • 31,443
  • 4
  • 72
  • 97
  • 2
    If you're going to write a custom swap function, just use assignment by value because that's optimal for `__m128i` values. gcc does manage to optimize away the memcpy and keep values in registers (with a test function that takes two `__m128i` args and returns a `__m128i`), [but ICC13 doesn't](https://godbolt.org/g/q0xYxo). Simple assignment is much less likely to have a negative impact on optimization. ICC13 has no problem with it, for example. – Peter Cordes Jul 17 '16 at 22:28
  • Even gcc can trip up when using `memswap` on `__m128i`, if the values are both already in memory. See my answer. – Peter Cordes Jul 17 '16 at 23:05
2

This doesn't sound like a best-practices issue; it sounds like you need a workaround for a seriously broken implementation of intrinsics. If __m128i tmp = a; doesn't compile, that's pretty bad.


If you're going to write a custom swap function, keep it simple. __m128i is a POD type that fits in a single vector register. Don't do anything that will encourage the compiler to spill it to memory. Some compilers will generate really horrible code even for a trivial test-case, and even gcc/clang might trip over a memcpy as part of optimizing a big complicated function.

Since the compiler is choking on the constructor, just declare a tmp variable with a normal initializer, and use = assignment to do the copying. That always works efficiently in any compiler that supports __m128i, and is a common pattern.

Plain assignment to/from values in memory works like _mm_store_si128 / _mm_load_si128: i.e. movdqa aligned stores/loads that will fault if used on unaligned addresses. (Of course, optimization can result in loads getting folded into memory operands to another vector instruction, or stores not happening at all.)

// alternate names: assignment_swap
// or swap128, but then the name doesn't fit for __m256i...

// __m128i t(a) errors, so just use simple initializers / assignment
template<class T>
void vecswap(T& a, T& b) {
    // T t = a;     // Apparently SunCC even choked on this
    T t;
    t = a;
    a = b;
    b = t;
}

Test cases: optimal code even with a crusty compiler like ICC13 which does a terrible job with the memcpy version. asm output from the Godbolt compiler explorer, with icc13 -O3

__m128i test_return2nd(__m128i x, __m128i y) {
    vecswap(x, y);
    return x;
}

    movdqa    xmm0, xmm1
    ret                    # returning the 2nd arg, which was in xmm1


__m128i test_return1st(__m128i x, __m128i y) {
    vecswap(x, y);
    return y;
}

    ret                   # returning the first arg, already in xmm0

With memswap, you get something like

return1st_memcpy(__m128i, __m128i):        ## ICC13 -O3
    movdqa    XMMWORD PTR [-56+rsp], xmm0
    movdqa    XMMWORD PTR [-40+rsp], xmm1    # spill both
    movaps    xmm2, XMMWORD PTR [-56+rsp]    # reload x
    movaps    XMMWORD PTR [-24+rsp], xmm2    # copy x to tmp
    movaps    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    movaps    XMMWORD PTR [-56+rsp], xmm0    # copy y to x
    movaps    xmm0, XMMWORD PTR [-24+rsp]    # reload tmp
    movaps    XMMWORD PTR [-40+rsp], xmm0    # copy tmp to y
    movdqa    xmm0, XMMWORD PTR [-40+rsp]    # reload y
    ret                                      # return y

This is pretty much the absolute maximum amount of spilling/reloading you could imagine to swap two registers, because icc13 doesn't optimize between the inlined memcpys at all, or even remember what is left in a register.


Swapping values already in memory

Even gcc makes worse code with the memcpy version. It does the copy with 64bit integer loads/stores instead of a 128bit vector load/store. This is terrible if you're about to load the vector (store-forwarding stall), and otherwise is just bad (more uops to do the same work).

// the memcpy version of this compiles badly
void test_mem(__m128i *x, __m128i *y) {
    vecswap(*x, *y);
}
    # gcc 5.3 and ICC13 make the same code here, since it's easy to optimize
    movdqa  xmm0, XMMWORD PTR [rdi]
    movdqa  xmm1, XMMWORD PTR [rsi]
    movaps  XMMWORD PTR [rdi], xmm1
    movaps  XMMWORD PTR [rsi], xmm0
    ret

// gcc 5.3 with memswap instead of vecswap.  ICC13 is similar
test_mem_memcpy(long long __vector(2)*, long long __vector(2)*):
    mov     rax, QWORD PTR [rdi]
    mov     rdx, QWORD PTR [rdi+8]
    mov     r9, QWORD PTR [rsi]
    mov     r10, QWORD PTR [rsi+8]
    mov     QWORD PTR [rdi], r9
    mov     QWORD PTR [rdi+8], r10
    mov     QWORD PTR [rsi], rax
    mov     QWORD PTR [rsi+8], rdx
    ret
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Thanks Peter, That's the kind of insight I was looking for. The last open question: should I provide a [`std::swap` specialization for `__m128i`](http://stackoverflow.com/q/11562), or simply leave it as a stand alone function? – jww Jul 17 '16 at 23:16
  • 1
    @jww: It's probably best to just call it `vecswap`, so you can use it on derived types. (e.g. [Agner Fog's Vector Class Library wrappers like `Vec4i`](http://www.agner.org/optimize/#vectorclass)). If you did specialize, you'd probably want to do it for `__m128i`, `__m128d`, `__m128`, `__m256*`, `__m512*`, and any other SIMD types for other architectures. (ARM NEON or w/e), and then it's pretty bulky. I'd be happier just using a different function which I'm sure is always light-weight and optimizes well. If you're sure you only need a few types, worth considering a specialisation though. – Peter Cordes Jul 17 '16 at 23:31
  • 1
    Believe it or not, this resulted in the original compile problem under SunCC: `T t = a; a = b; b = t;`. I had to `T t; t=a, a=b, b=t;` to get the compiler to stop trying to initialize it. – jww Jul 18 '16 at 03:31
  • @jww: Wow, that compiler is going to choke on a lot of code, then. I assume it chokes on `__m128i v = _mm_add_epi32(a,b);` as well? For most code, I think I'd just consider it broken rather than rewrite any significant amount of code for its benefit. – Peter Cordes Jul 18 '16 at 03:56
  • @PeterCordes How is `vecswap()` different from `std::swap()`? – Leon Jul 18 '16 at 06:34
  • @Leon: IDK, I don't have SunCC. Maybe it's identical, since jww had to modify my code to avoid the error, see 2 comments ago. I updated the code in my answer, which I should have done 3 hours ago. – Peter Cordes Jul 18 '16 at 06:38