4

I'm using GNU C Vector Extensions, not Intel's _mm_* intrinsics.

I want to do the same thing as Intel's _m256_loadu_pd intrinsic. Assigning the values one by one is slow: gcc produces code that has 4 load instructions, rather than one single vmovupd (which _m256_loadu_pd does generate).

typedef double vector __attribute__((vector_size(4 * sizeof(double))));

int main(int argc, char **argv) {
    double a[4] = {1.0, 2.0, 3.0, 4.0};
    vector v;

    /* I currently do this */
    v[0] = a[0];
    v[1] = a[1];
    v[2] = a[2];
    v[3] = a[3];
}

I want something like this:

v = (vector)(a);

or

v = *((vector*)(a));

but neither work. The first fails with "can't convert value to a vector" while the second results in segfaults.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jerry Zhao
  • 184
  • 11
  • Choose one language. C and C++ are different language. – Shravan40 Aug 24 '16 at 04:02
  • @Shravan40: The original question was tagged C and is asking about C, the C++ edit was by someone else (now reverted). – dave Aug 24 '16 at 04:05
  • 3
    @dvhh: the code is fairly clearly not using C++ `vector` template notation and the title says 'C', not 'C++'. I don't see why C++ is appropriate. I assume that at one point it had both tags, but that seems to have been fixed. – Jonathan Leffler Aug 24 '16 at 04:05
  • 4
    If you want `memcpy`, just use `memcpy`. – David Schwartz Aug 24 '16 at 04:06
  • @DavidSchwartz: does memcpy produce the efficient code wanted on your machine? – dave Aug 24 '16 at 04:27
  • @dave If `memcpy`, given all the information, didn't produce the most efficient code on my machine, I'd use a better compiler/library/whatever. What possible excuse is there for not doing it the best way? – David Schwartz Aug 24 '16 at 04:28
  • Unless your HW architecture supports 256-bit load/store operations, I don't see any reason to hope that the assignment operation of `v = a` (even if you somehow find the correct syntax for it) will be compiled into a single opcode. If it doesn't support 128-bit load/store operations either, then you may as well copy these 4 values one by one (as you are currently doing). – barak manos Aug 24 '16 at 05:09
  • why don't just use `_m256_loadu_pd`? – phuclv Aug 24 '16 at 06:06
  • @2501: IIRC, GNU C Vector Extension pointers are defined as `may_alias`. At least the Intel intrinsic types (like `__m256d`) are defined that way in GNU C, so it's safe to mix vector loads/stores to an array of `double` with scalar loads/stores to the same array, and casting `double*` to `__m256d*`. But anyway, that doesn't explain the segfault. Failure to detect aliasing leads to loading bogus data, not segfaults (at least not directly. Sure you could construct a case where something uses a stale or uninitialized pointer, but this isn't like that.) – Peter Cordes Aug 24 '16 at 06:09
  • @barakmanos: he *is* compiling for a target that supports 256b load/store operations; specifically x86 with AVX. Even baseline x86-64 includes 128b SSE2. (Most architectures these days have 128b SIMD vectors. ARM, MIPS, PowerPC, and x86 all have more or less widely supported vector extensions. 256b vectors may be unique to x86, though.) – Peter Cordes Aug 24 '16 at 06:17
  • In this particular example, you could write directly `vector v = {1.0, 2.0, 3.0, 4.0};`, although I expect this was just to demonstrate the problem. – Marc Glisse Aug 24 '16 at 18:45
  • 1
    @PeterCordes may_alias is not necessary to let the types vector of double and double alias (just like array of double can alias double). `__m256d` is may_alias because the Intel specs want all the vector types of the same size to be compatible. – Marc Glisse Aug 24 '16 at 18:48

3 Answers3

4

update: I see you're using GNU C's native vector syntax, not Intel intrinsics. Are you avoiding Intel intrinsics for portability to non-x86? gcc currently does a bad job compiling code that uses GNU C vectors wider than the target machine supports. (You'd hope that it would just use two 128b vectors and operate on each separately, but apparently it's worse than that.)

Anyway, this answer shows how you can use Intel x86 intrinsics to load data into GNU C vector-syntax types


First of all, looking at compiler output at less than -O2 is a waste of time if you're trying to learn anything about what will compile to good code. Your main() will optimize to just a ret at -O2.

Besides that, it's not totally surprising that you get bad asm from assigning elements of a vector one at a time.


Aside: normal people would call the type v4df (vector of 4 Double Float) or something, not vector, so they don't go insane when using it with C++ std::vector. For single-precision, v8sf. IIRC, gcc uses type names like this internally for __m256d.

On x86, Intel intrinsic types (like __m256d) are implemented on top of GNU C vector syntax (which is why you can do v1 * v2 in GNU C instead of writing _mm256_mul_pd(v1, v2)). You can convert freely from __m256d to v4df, like I've done here.

I've wrapped both sane ways to do this in functions, so we can look at their asm. Notice how we're not loading from an array that we define inside the same function, so the compiler won't optimize it away.

I put them on the Godbolt compiler explorer so you can look at the asm with various compile options and compiler versions.

typedef double v4df __attribute__((vector_size(4 * sizeof(double))));

#include <immintrin.h>

// note the return types.  gcc6.1 compiles with no warnings, even at -Wall -Wextra
v4df load_4_doubles_intel(const double *p) { return _mm256_loadu_pd(p); }
    vmovupd ymm0, YMMWORD PTR [rdi]   # tmp89,* p
    ret

v4df avx_constant() { return _mm256_setr_pd( 1.0, 2.0, 3.0, 4.0 ); }
    vmovapd ymm0, YMMWORD PTR .LC0[rip]
    ret

If the args to _mm_set* intrinsics aren't compile-time constants, the compiler will do the best it can to make efficient code to get all the elements into a single vector. It's usually best to do that rather than writing C that stores to a tmp array and loads from it, because that's not always the best strategy. (Store-forwarding failure on multiple narrow stores forwarding to a wide load costs an extra ~10 cycles (IIRC) of latency on top of the usual store-forwarding delay. If your doubles are already in registers, it's usually best to just shuffle them together.)


See also Is it possible to cast floats directly to __m128 if they are 16 byte alligned? for a list of the various intrinsics for getting a single scalar into a vector. The tag wiki has links to Intel's manuals, and their intrinsics finder.


Load/store GNU C vectors without Intel intrinsics:

I'm not sure how you're "supposed" to do that. This Q&A suggests casting a pointer to the memory you want to load, and using a vector type like typedef char __attribute__ ((vector_size (16),aligned (1))) unaligned_byte16; (note the aligned(1) attribute).

You get a segfault from *(v4df *)a because presumably a isn't aligned on a 32-byte boundary, but you're using a vector type that does assume natural alignment. (Just like __m256d if you dereference a pointer to it instead of using load/store intrinsics to communicate alignment info to the compiler.)

Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Using attribute aligned seems like the normal way to load/store vectors in unaligned memory. – Marc Glisse Aug 24 '16 at 18:43
  • 1
    And that's even how gcc implements _mm256_loadu_pd now ;-) https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/i386/avxintrin.h?revision=239889&view=markup#l868 – Marc Glisse Aug 31 '16 at 12:08
1

You can use the equivalent intrinsics from gcc for x86: __builtin_ia32_loadupd256 (https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html#x86-Built-in-Functions).

So something like:

typedef double v4df __attribute__((vector_size(4 * sizeof(double))));

void vector_copy(double *a, v4df *v)
{
    *v = __builtin_ia32_loadupd256(a);
}
dave
  • 4,812
  • 4
  • 25
  • 38
  • That's not portable to non-x86 platforms. I just realized that's what the OP is doing. – Peter Cordes Aug 24 '16 at 05:36
  • @PeterCordes: You are right, I've added a note that this is for x86 (which is what I thought OP wanted). Your answer is still better though. – dave Aug 24 '16 at 06:49
  • Let me know if you come across a platform-independent way to write vector loads/stores with GNU C vector extensions. I find it super-weird that there isn't a portable aligned / unaligned load/store builtin, only the platform-specific builtins. Maybe you are supposed to pointer-cast and dereference? Anyway, upvoted your answer for pointing out the GNU C builtin that `_mm256_loadu_pd` is implemented on top of. – Peter Cordes Aug 24 '16 at 07:00
  • 3
    Please don't use those builtins. The only advantage over intrinsics is that you don't need a #include. But it may be removed from gcc in the future. – Marc Glisse Aug 24 '16 at 18:38
-2

If you don't need to get a copy of a, use a pointer instead (see v_ptr in example). If you need a copy, use memmove (see v_copy)

#include <stdio.h>
#include <string.h>

typedef double vector __attribute__((vector_size(4 * sizeof(double))));

int main(int argc, char **argv) {
  double a[4] = {1.0, 2.0, 3.0, 4.0};
  vector *v_ptr;
  vector v_copy;

  v_ptr = (vector*)&a;
  memmove(&v_copy, a, sizeof(a));

  printf("a[0] = %f // v[0] = %f // v_copy[0] = %f\n", a[0], (*v_ptr)[0], v_copy[0]);
  printf("a[2] = %f // v[2] = %f // v_copy[0] = %f\n", a[2], (*v_ptr)[2], v_copy[2]);
  return 0;
}

output:

a[0] = 1.000000 // v[0] = 1.000000 // v_copy[0] = 1.000000
a[2] = 3.000000 // v[2] = 3.000000 // v_copy[0] = 3.000000
  • can you show the output of `gcc -S` which shows whether the compiler did separate copies or used the SSE instructions? – dave Aug 24 '16 at 04:31
  • My assembly is rusty and I am in a mac using clang, but you should be able to build the example above to verify. You can split these into two files, one with c_ptr and one with c_copy and check. On my machine, when I compile with -O3, both of the examples compile to the same .s file but that is likely because the optimizer realizes that I am not modifying v_copy so it optimizes the memmove away. – Miguel Sosa Aug 24 '16 at 04:51