9

GCC's vector extensions offer a nice, reasonably portable way of accessing some SIMD instructions on different hardware architectures without resorting to hardware specific intrinsics (or auto-vectorization).

A real use case, is calculating a simple additive checksum. The one thing that isn't clear is how to safely load data into a vector.

typedef char v16qi __attribute__ ((vector_size(16)));

static uint8_t checksum(uint8_t *buf, size_t size)
{
    assert(size%16 == 0);
    uint8_t sum = 0;

    vec16qi vec = {0};
    for (size_t i=0; i<(size/16); i++)
    {
        // XXX: Yuck! Is there a better way?
        vec += *((v16qi*) buf+i*16);
    }

    // Sum up the vector
    sum = vec[0] + vec[1] + vec[2] + vec[3] + vec[4] + vec[5] + vec[6] + vec[7] + vec[8] + vec[9] + vec[10] + vec[11] + vec[12] + vec[13] + vec[14] + vec[15];

    return sum;
}

Casting a pointer to the vector type appears to work, but I'm worried this might explode in a horrible fashion if SIMD hardware expects the vector types to be correctly aligned.

The only other option I've thought of is use a temp vector and explicitly load the values (via either a memcpy or element-wise assignment), but in testing this counteract most of speedup gained use of SIMD instructions. Ideally I'd imagine this would be something like a generic __builtin_load() function, but none seems to exist.

What's a safer way of loading data into a vector risking alignment issues?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
dcoles
  • 3,785
  • 2
  • 28
  • 26
  • 2
    Running this on unaligned memory on GCC x86_64 will cause a SIGSEGV when the CPU attempts to load the unaligned memory into a SSE register. One reasonable option seems to be either only checksum aligned memory or use a normal loop to sum the bytes up until the first 16 byte boundary. – dcoles Feb 17 '12 at 00:18
  • In your current code, loading the data actually compiles nicely if the compiler knows about the input (but the sum is bad): https://godbolt.org/g/DeR3Qv. It's not so nice without knowledge of the input: https://godbolt.org/g/LxEkhp – ZachB Sep 21 '16 at 18:14

2 Answers2

4

Edit (thanks Peter Cordes) You can cast pointers:

typedef char v16qi __attribute__ ((vector_size (16), aligned (16)));

v16qi vec = *(v16qi*)&buf[i]; // load
*(v16qi*)(buf + i) = vec; // store whole vector

This compiles to vmovdqa to load and vmovups to store. If the data isn't known to be aligned, set aligned (1) to generate vmovdqu. (godbolt)

Note that there are also several special-purpose builtins for loading and unloading these registers (Edit 2):

v16qi vec = _mm_loadu_si128((__m128i*)&buf[i]); // _mm_load_si128 for aligned
_mm_storeu_si128((__m128i*)&buf[i]), vec); // _mm_store_si128 for aligned

It seems to be necessary to use -flax-vector-conversions to go from chars to v16qi with this function.

See also: C - How to access elements of vector using GCC SSE vector extension
See also: SSE loading ints into __m128

(Tip: The best phrase to google is something like "gcc loading __m128i".)

Community
  • 1
  • 1
ZachB
  • 13,051
  • 4
  • 61
  • 89
  • 1
    Apparently the recommended way to load unaligned data into GNU C vectors is with an `aligned(1)` attribute when declaring a vector type, and cast pointers to that unaligned-vector type. e.g. `typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16;`. See [the end of my answer here](http://stackoverflow.com/a/39115055/224132), and Marc Glisse's comments on it. – Peter Cordes Sep 21 '16 at 07:03
  • To extract, I think you should be using `vec[0]`. As I understand it, aliasing scalar pointers onto vector types is *not* ok. It works with `char*` because `char*` is special, and allowed to alias anything. Casting an `int*` to a `v4si*` doesn't even count as aliasing, because v4si is defined in terms of `int`. The Intel intrinsics types (`__m128i`) can alias onto other things too, because of an extra attribute: `typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));` Without may_alias, you can't safely `v4si ivec = *(v4si)short_pointer`. I left that out before – Peter Cordes Sep 21 '16 at 19:20
  • See `/usr/lib/gcc/x86_64-linux-gnu/5/include/emmintrin.h` or wherever your copy of gcc keeps that header. – Peter Cordes Sep 21 '16 at 19:21
  • Re: extract, just realized that what I had was moving a single byte, changed back to memcpy for the moment, still digging... thanks for the tips – ZachB Sep 21 '16 at 19:21
  • 2
    This issue of how to correctly get data into/out of GNU C vector extensions really seems to need a tutorial, or longer canonical answer. I might be able to write one, but I haven't used them in any code I've written, other than experiments on Godbolt. – Peter Cordes Sep 21 '16 at 19:23
2

You could use an initializer to load the values, i.e. do

const vec16qi e = { buf[0], buf[1], ... , buf[15] }

and hope that GCC turns this into a SSE load instruction. I'd verify that with a dissassembler, though ;-). Also, for better performance, you try to make buf 16-byte aligned, and inform that compiler via an aligned attribute. If you can guarantee that the input buffer will be aligned, process it bytewise until you've reached a 16-byte boundard.

fgp
  • 8,126
  • 1
  • 17
  • 18
  • I don't think aligning buf is necessary. It would be, if we were dealing with pointers. – user1095108 Oct 15 '13 at 22:06
  • @user1095108 You want the compiler to turn this into an SSE load instruction, which is the equivalent of `e = *buf`(but you can't write it that way because the types don't match). So you ARE dealing with pointers here, actually. If the compiler can infer that buf is 16-byte aligned it can thus use an aligned load, which (pre ivy-bridge, at least) is faster than an unaligned load. – fgp Oct 16 '13 at 13:17
  • No, you'd be dealing with pointers if you were to cast `buf` to `vec16qi` from my experience. – user1095108 Oct 16 '13 at 14:16
  • @user1095108 You missunderstood me, I think. You of course aren't dealing with pointers, *strictly speaking*. But you're loading a values (16 values, actually) *pointed to* by `buf, which is *exactly* what dereferencing a pointer (of type vec16qi) would do. Now, since we're not *strictly speaking* dereferencing `buf`, the pointer doesn't *have* to be aligned for correctness. *But* it might still make a huge difference in performance - and it indeed does on some CPUs. That's assuming the compiler even turns this into an SSE load instructions. – fgp Oct 16 '13 at 19:48
  • On my machine, I only see alignment issues when dealing with pointers directly, not when dereferencing them while loading them into a vector. – user1095108 Oct 16 '13 at 19:52
  • GCC7 doesn't seem to do anything smart with the syntax in this answer: https://godbolt.org/g/epocpU. Each byte is moved with `movzbl`. Not sure if some alignment hints would do the trick but doesn't look likely. – ZachB Sep 21 '16 at 02:40