What does it mean for an SSE vector to be "16 byte alligned" and how can I ensure that it is?

Question

I'm working with vectors and matrices right now and it was suggested to me that I should use SSE instead of using float arrays. However while reading the definition for the C intrinsics and the Assembly instructions it looks like there is a different version of some of the function where the vector has to be "16 byte aligned" and a slower version where the vector isn't aligned. What does having the vector be 16 byte aligned mean? How can I ensure that my vectors are 16 byte aligned?

Pretty sure it means your structs will be padded so that the size is always a multiple of 16 bytes...Then you get much better 'bus' transport on the motherboard etc, as a discrete set of vectors will sit on the bus and travel together without being broken up and re-assembled....Does that make sense? — Grantly, Dec 15 '17 at 21:22
@Grantly so If I'm just using a variable of type `__m128` or using a union where the biggest member is 16 bytes I can use those functions? — zee, Dec 15 '17 at 21:24
Sorry, I have to vote down since an incorrect answer was marked accepted, and this may mislead future readers. — Eric Postpischil, Dec 15 '17 at 22:06
I edited and improved my answer yet again...I was torn whether to delete it or not - giving respect to those who answered the question more accurately. But I did alot of reading and hope the answer is now on topic and helpful. If it pleases - I will delete it — Grantly, Dec 16 '17 at 19:30

zneak · Accepted Answer · 2017-12-18T05:41:43.970

Alignment ensures that objects are aligned on an address that is a multiple of some power of two. 16-byte-aligned means that the numeric value of the address is a multiple of 16. Alignment is important because CPUs are often less efficient or downright incapable of loading memory that doesn't have the required alignment.

Your ABI determines the natural alignment of types. In general, integer types and floating-point types are aligned to either their own size, or the size of the largest object of that kind that your CPU can treat at once, whichever is smaller. For instance, on 64-bit Intel machines, 32-bit integers are aligned on 4 bytes, 64-bit integers are aligned on 8 bytes, and 128-bit integers are also aligned on 8 bytes.

The alignment of structures and unions is the same as their most aligned field. This means that if your struct contains a field that has a 2-byte alignment and another field that has an 8-byte alignment, the structure will be aligned to 8 bytes.

In C++, you can use the alignof operator, just like the sizeof operator, to get the alignment of a type. In C, the same construct becomes available when you include <stdalign.h>; alternatively, you can use _Alignof without including anything.

AFAIK, there is no standard way to force alignment to be specific value in C or C++, but there are compiler-specific extensions to do it. On Clang and GCC, you can use the __attribute__((aligned(N))) attribute:

struct s_Stuff {
   int var1;
   short  var2;
   char padding[10];
} __attribute__((aligned(16)));

(Example.)

(This attribute is not to be confused with __attribute__((align(N))), which sets the alignment of a variable.)

Off the top of my head, I'm not sure for Visual Studio, but according to SoronelHaetir, that would be __declspec(align(N)). Not sure where it goes on the struct declaration.

In the context of vector instructions, alignment is important because people tend to create arrays of floating-point values and operate on them, instead of using types that are known to be aligned. However, __m128, __m256 and __m512 (and all of their variants, like _m128i and such) from <emmintrin.h>, if your compiler environment has it, are guaranteed to be aligned on the proper boundaries for use with aligned intrinsics.

Depending on your platform, malloc may or may not return memory that is aligned on the correct boundary for vector objects. aligned_alloc was introduced in C11 to address these issues, but not all platforms support it.

Apple: does not support aligned_alloc; malloc returns objects on the most exigent alignment that the platform supports;
Windows: does not support aligned_alloc; malloc returns objects aligned on the largest alignment that VC++ will naturally put an object on without an alignment specification; use _aligned_malloc for vector types
Linux: malloc returns objects aligned on an 8- or 16-byte boundary; use aligned_alloc.

In general, it's possible to request slightly more memory and perform alignment yourself with minimal penalties (aside that you're on your own to write a free-like function that will accept a pointer returned by this function):

void* aligned_malloc(size_t size, size_t alignment) {
    intptr_t alignment_mask = alignment - 1;
    void* memory = malloc(size + alignment_mask);
    intptr_t unaligned_ptr = (intptr_t)memory;
    intptr_t aligned_ptr = (unaligned_ptr + alignment_mask) & ~alignment_mask;
    return (void*)aligned_ptr;
}

Purists might argue that treating pointers as integers is evil, but at the time of writing, they probably won't have a practical cross-platform solution to offer in exchange.

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/161465/discussion-on-answer-by-zneak-what-does-it-mean-for-an-sse-vector-to-be-16-byte). — Andy, Dec 18 '17 at 18:15

score 2 · Answer 2 · answered Dec 15 '17 at 21:28

xx-byte alignment means that a the variable's memory address modulo xx is 0.

Ensuring that is a compiler-specific operation, visual c++ for example has __declspec(align(...)), which will work for variables that the compiler allocates (at file or function scope for example), alignment is somewhat harder for dynamic memory, you can use aligned_malloc for that, although your library may already guarantee 16-byte alignment for malloc, it's generally larger alignments that require such a call.

Grantly · Answer 3 · 2017-12-17T00:43:10.980

New Edit to improve and focus my answer to the specific query

To ensure data alignment in memory, there are specific functions in C to force this (assuming your data is compatible - where your data matches or discretely fits into your required alignment)

The function to use is [_aligned_malloc][1] instead of vanilla malloc.

// Using _aligned_malloc  
// Note alignment should be 2^N where N is any positive int.  
int alignment = 16;
ptr = _aligned_malloc('required_size', alignment);  
if (ptr == NULL)  
{  
    printf_s( "Error allocation aligned memory.");  
    return -1;  
}

This will (if it succeeds) force your data to align on the 16 byte boundary and should satisfy the requirements for SSE.

Older answer where I waffle on about struct member alignment, which matters - but is not directly answering the query

To ensure struct member byte alignment, you can be careful how you arrange members in your structs (largest first), or you can set this (to some degree) in your compiler settings, member attributes or struct attributes.

Assuming 32 bit machine, 4 byte ints: This is still 4 byte aligned in memory (first largest member is 4 bytes), but padded to be 16 bytes in size.

struct s_Stuff {
   int var1;  /* 4 bytes */
   short  var2;  /* 2 bytes */
   char padding[10];  /* ensure totals struct size is 16 */
}

The compiler usually pads each member to assist with natural alignment, but the padding may be at the end of the struct too. This is struct member data alignment.

Older compiler struct member alignment settings could look similar to these 2 images below...But this is different to data alignment which relates to memory allocation and storage of the data.

It confuses me when Borland uses the phrase (from the images) Data Alignment, and MS uses Struct member alignment. (Although they both refer to specifically struct member alignment)

To maximise efficiency, you need to code for your hardware (or vector processing in this case), so lets assume 32 bit, 4 byte ints, etc. Then you want to use tight structs to save space, but padded structs may improve speed.

struct s_Stuff {
   float f1;   /* 4 bytes */
   float f2;   /* 4 bytes */
   float f3;   /* 4 bytes */
   short  var2;  /* 2 bytes */
}

This struct may be padded to also align the struct members to 4 byte multiples....The compiler will do this unless you specify that it keeps single byte struct member alignment - so the size ON FILE could be 14 bytes, but still in MEMORY an array of this struct would be 16 bytes in size (with 2 bytes wasted), with an unknown data alignment (possibly 8 bytes as default by malloc but not guaranteed. As mentioned above you can force the data alignment in memory with _aligned_malloc on some platforms)

Also regarding member alignment in a struct, the compiler will use multiples of the largest member to set the alignment. Or more specifically:

A struct is always aligned to the largest type’s alignment requirements

...from here

If you are using a UNION, you are correct that it is forced to the largest possible struct see here

Check that your compiler settings do not contradict your desired struct member alignment / padding too, or else your structs may differ in size to what you expect.

Now, why is it faster? See here which explains how alignment allows the hardware to transmit discrete chunks of data and maximises the use of the hardware that passes around data. That is, the data does not need to be split up or re-arranged at every stage - through the hardware processing

As a rule, its best to set your compiler to resonate with your hardware (and platform OS) so that your alignment (and padding) works best with your hardware processing ability. 32 bit machines usually work best with 4 byte (32 bit) member alignment, but then data written to file with 4 byte member alignment can consume more space than wanted.

Specifically regarding SSE vectors, as this link states, 4 * 4 bytes is they best way to ensure 16 byte alignment, perhaps like this. (And they refer to data alignment here)

struct s_data {
   float array[4];
}

or simply an array of floats, or doubles.

Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/161466/discussion-on-answer-by-grantly-what-does-it-mean-for-an-sse-vector-to-be-16-by). — Andy, Dec 18 '17 at 18:15

What does it mean for an SSE vector to be "16 byte alligned" and how can I ensure that it is?

3 Answers3