3

Imagine that, after some SIMD calculations, I get a __m128i value with the fourth field with a useless zero value. Is there a simple and portable way to cast the other three fields into a std::tuple<int,int,int>, bearing in mind it is not standard layout?

Community
  • 1
  • 1
metalfox
  • 6,301
  • 1
  • 21
  • 43
  • "Simple" often conflicts with "portable" and/or "standards-conforming." There's no requirement that a `std::tuple` have standard layout. However, you'll likely find that most implementations yield a memory layout that is as you would expect. If that's the case, the implementation that you already have in your head would *probably* work, if strict standards compliance isn't a requirement. Since you referenced `__m128i`, you're on x86, so I'm not considering any strange padding/alignment requirements that `int` might have; a `__m128i` is laid out just like an `int[4]`. – Jason R May 12 '17 at 12:28
  • @JasonR: Order of fields of `tuple` is unspecified... so the *"probably work"* is too optimistic IMO. – Jarod42 May 12 '17 at 12:29
  • @Jason R actually, for SIMD on x86 platform '_m128' _requires_ alignment(16) – Swift - Friday Pie May 12 '17 at 12:31
  • @Jarod42: Perhaps. The lesson to the OP is probably that no, there's not a portable or standards-compliant way to do this. With that said, there's lots of C++ code out there that isn't perfectly standards-compliant. If you can bound your set of platforms and compiler/library versions, and you're willing to take on the potential maintenance headache in the future, it may be possible. – Jason R May 12 '17 at 12:32
  • @Swift: Not really. A `__m128` is really just an abstraction on top of a 16-byte SSE register. These can be loaded/stored from/to memory with aligned and unaligned memory instructions, so if you have an `__m128` type, it is possible to store it at an unaligned location. With that said, compilers often treat them as having `alignment(16)` when automatically generating code that manipulates them (e.g. when registers spill to the stack). It's a common misconception that you need aligned data to use SIMD on x86; 128-bit unaligned loads have been (mostly) penalty-free for many CPU generations now. – Jason R May 12 '17 at 12:34
  • @Jason R If you would use _mm_loadu_ps directly, yes. Presense of penalty really depends on cache setup, platform, amount of subsequent operations, etc. If _mm_load_ps is used with with unaligned memory it would crash and I work on project that mostly runs on quite a number of uneven platforms (3000 generation of Intel and below) as same time, and benefits from aligned load due to sheer amount of data processed (thing of gigabytes per second) – Swift - Friday Pie May 12 '17 at 12:42
  • 1
    @Swift: I work on similar projects. I would encourage you to benchmark changing `_mm_load_ps()` to `_mm_loadu_ps()`. You'll find that for like conditions, their performance is essentially indistinguishable; the choice of instruction you use doesn't really matter. With that said, aligned loads *can* be faster, since they're guaranteed to not straddle cache line or page boundaries, but it doesn't matter which type of instruction you use. Assuming unaligned 128-bit memory operations can simplify your code structure and loosen its constraints on its input and output. – Jason R May 12 '17 at 12:45
  • @Jason R I'll still have constraint on output, for it accepted by hardware requiring alignment. I think, problem might be in compiler (PGI C++) that uses something slower in case of unaligned intristic. using loadu on aligned data results on spike drops in performance compared to autovectorized code. – Swift - Friday Pie May 12 '17 at 12:52
  • @Jason R actually, I looked up, all my platform are either Atom-like or pre-Nehalem, what you say about is starting appear to be true with Nehalem and above – Swift - Friday Pie May 12 '17 at 13:00
  • If your SIMD vector is like a 3-tuple, you're probably using an inefficient AoS-based layout with an inefficient lane-wasting calculation. – harold May 12 '17 at 16:57

1 Answers1

1

Ugly, but portable. I don't believe, that there is fast solution, since std::tuple does not have defined memory layout. So just copying those three values into a tuple.

std::tuple<int, int, int> to_tuple(__m128i& value)
{
    auto* ptr = reinterpret_cast<int*>(&value);
    return std::make_tuple(ptr[0], ptr[1], ptr[2]);
}

Why do you need this? Maybe you can get around your problem some other way.

Zereges
  • 5,139
  • 1
  • 25
  • 49
  • This is what I was doing, albeit with `_mm_storeu_si128` instead of `reinterpret_cast`, but I think the resulting assembly is the same. I hoped there was a more elegant solution... – metalfox May 15 '17 at 07:39