Process unaligned part of a double array, vectorize the rest

Question

I am generating sse/avx instructions and currently i have to use unaligned load and stores. I operate on a float/double array and i will never know whether it will be aligned or not. So before vectorizing it, i would like to have a pre and possibly a post loop, which takes care about the unaligned part. The main vectorized loop operates then on the aligned part.

But how do i determine when an array is aligned? Can i check the pointer value? When should the pre-loop stop and the post-loop start?

Here is my simple code example:

void func(double * in, double * out, unsigned int size){
    for( as long as in unaligned part ){
        out[i] = do_something_with_array(in[i])
    }
    for( as long as aligned ){
        awesome avx code that loads operates and stores 4 doubles
    }
    for( remaining part of array ){
        out[i] = do_something_with_array(in[i])
    }
 }

Edit: I have been thinking about it. Theoretically the pointer to the i'th element should be dividable (something like &a[i]%16==0) by 2,4,16,32 (depending whether it is double and whether it is sse or avx). So the first loop should cover up the elements, which are not dividable.

Practically i will try the compiler pragmas and flags out, to see what does the compiler produce. If no one gives a good answer i will then post my solution (if any) on weekend.

Have you looked at the code an auto-vectorizing compiler like `gcc` or `clang` will produce to deal with possible misalignment? I suspect you don't quite know what you're getting yourself into. — EOF, May 30 '16 at 10:34
Yes, decaying the array to a pointer and doing arithmetic on it will help you determine if it's on an nice aligned address or not. *But...* Checking that won't help. If the address of the first element is not nicely aligned, then the addresses to *all* elements may be unaligned, or at least many or most of the elements. A better idea might be to use dynamic allocation and add an initial padding to make sure you have an aligned "array". — Some programmer dude, May 30 '16 at 10:35
@JoachimPileborg: I thought the question was about data misaligned for vector-instructions, in which case the individual *elements* of the vector *are* aligned. — EOF, May 30 '16 at 10:38
@EOF you made some good hints, yes auto-vectorization might be good start, but as far as i recall, gcc and clang produce unaligned load and stores, but i will double check it. — hr0m, May 30 '16 at 11:17
@JoachimPileborg shouldn't the array be aligned at least somewhere? i am getting more and more confused — hr0m, May 30 '16 at 11:18
The `float` type is typically 32 bits, and `double` is 64 bits, i.e. 4 and 8 bytes respectively. Now if the first element is on an odd address, *all* elements will be on odd addresses. Also consider e.g. an array of `float` starting at address 0xyy02, then the next few elements will be on addresses 0xyy06, 0xyy0a, etc., they are also not aligned. — Some programmer dude, May 30 '16 at 11:28
@JoachimPileborg - the situation you describe wouldn't naturally occur anyway as arrays are aligned according to the underlying data types - so `double` arrays will be 8-byte aligned anyway. Thus as per EOF's comment it is really about the layout of elements within each individual SIMD instruction (be it 128- or 256, or512-bit). — Smeeheey, May 30 '16 at 11:41
@Smeeheey - so the point, I take it, is that "alignment" for vectorization means aligning with the instructions' preferred alignment, not just the natural alignment for the type being processed. — Pete Becker, May 30 '16 at 12:19
That is my understanding of the question. Note that "preferred" should read "required" in many cases - lots of AVX instructions segfault if you try to use unaligned memory. — Smeeheey, May 30 '16 at 13:09
yep, that's what i am asking @Smeeheey is complely right. my doubles are aligned in the memory, but not aligned enough for the sse/avx instructions. I have found a page from intel[1], where in figure 7 the use something called GOOD_ALLIGN. It's basicly what i want, however i don't quite understand it. [1]https://software.intel.com/en-us/articles/program-optimization-through-loop-vectorization — hr0m, May 30 '16 at 13:11
@Zboson that's all nice, but i don't want to tell the compiler whether it is aligned or not. I need to find out and do it myself — hr0m, Jun 03 '16 at 20:04
clang uses unaligned loads/stores. gcc generates unrolled prologue/epilogue code to align pointers. However, since your code has two pointers, and you haven't used `__builtin_assume_aligned` or anything, gcc has to assume that either or both pointers are misaligned. I forget whether gcc prefers to align the input or the output pointer. If the output is aligned and the input isn't, there's no way to do aligned loads and aligned stores (other than `palignr`, which IIRC isn't worth it on Nehalem and later, where unaligned stores are only slower when they cross cache lines). — Peter Cordes, Jun 04 '16 at 15:11

score 6 · Accepted Answer · edited May 23 '17 at 11:58

6

Here is some example C code that does what you want

#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>

#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))

int main(void) {
    int n = 17;
    int c = 1;
    double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
    double* p1 = p+c;
    for(int i=0; i<n; i++) p1[i] = 1.0*i;
    double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
    double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
    if(p2>p3) p2 = p3;

    printf("%p %p %p %p\n", p1, p2, p3, p1+n);
    double *t;
    for(t=p1; t<p2; t+=1) {
        printf("a %p %f\n", t, *t);
    }
    puts("");
    for(;t<p3; t+=SIMD_WIDTH) {
        printf("b %p ", t);
        for(int i=0; i<SIMD_WIDTH; i++) printf("%f ", *(t+i));
        puts("");
    }
    puts("");
    for(;t<p1+n; t+=1) {
        printf("c %p %f\n", t, *t);
    }  
}

This generates a 32-byte aligned buffer but then offsets it by one double in size so it's no longer 32-byte aligned. It loops over scalar values until reaching 32-btye alignment, loops over the 32-byte aligned values, and then lastly finishes with another scalar loop for any remaining values which are not a multiple of the SIMD width.

I would argue that this kind of optimization only really makes a lot of sense for Intel x86 processors before Nehalem. Since Nehalem the latency and throughput of unaligned loads and stores are the same as for the aligned loads and stores. Additionally, since Nehalem the costs of the cache line splits is small.

There is one subtle point with SSE since Nehalem in that unaligned loads and stores cannot fold with other operations. Therefore, aligned loads and stores are not obsolete with SSE since Nehalem. So in principle this optimization could make a difference even with Nehalem but in practice I think there are few cases where it will.

However, with AVX unaligned loads and stores can fold so the aligned loads and store instructions are obsolete.

I looked into this with GCC, MSVC, and Clang. GCC if it cannot assume a pointer is aligned to e.g. 16 bytes with SSE then it will generate code similar to the code above to reach 16 byte alignment to avoid the cache line splits when vectorizing.

Clang and MSVC don't do this so they would suffer from the cache-line splits. However, the cost of the additional code to do this makes up for cost of the cache-line splits which probably explains why Clang and MSVC don't worry about it.

The only exception is before Nahalem. In this case GCC is much faster than Clang and MSVC when the pointer is not aligned. If the pointer is aligned and Clang knows it then it will use aligned loads and stores and be fast like GCC. MSVC vectorization still uses unaligned stores and loads and is therefore slow pre-Nahalem even when a pointer is 16-byte aligned.

Here is a version which I think is a bit clearer using pointer differences

#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>

#define ALIGN 32
#define SIMD_WIDTH (ALIGN/sizeof(double))

int main(void) {
    int n = 17, c =1;

    double* p = _mm_malloc((n+c) * sizeof *p, ALIGN);
    double* p1 = p+c;
    for(int i=0; i<n; i++) p1[i] = 1.0*i;
    double* p2 = (double*)((uintptr_t)(p1+SIMD_WIDTH-1)&-ALIGN);
    double* p3 = (double*)((uintptr_t)(p1+n)&-ALIGN);
    int n1 = p2-p1, n2 = p3-p2;
    if(n1>n2) n1=n2;
    printf("%d %d %d\n", n1, n2, n);

    int i;
    for(i=0; i<n1; i++) {
        printf("a %p %f\n", &p1[i], p1[i]);
    }
    puts("");
    for(;i<n2; i+=SIMD_WIDTH) {
        printf("b %p ", &p1[i]);
        for(int j=0; j<SIMD_WIDTH; j++) printf("%f ", p1[i+j]);
        puts("");
    }
    puts("");
    for(;i<n; i++) {
        printf("c %p %f\n", &p1[i], p1[i]);
    }  
}

edited May 23 '17 at 11:58

Community

1
1

answered Jun 04 '16 at 09:17

Z boson

32,619
11
123
226

i will definitely try that out. It seems to me very reasonable. One remark towards your optimization discussion. Afaik you need aligned loads/store if you want to use the streaming instruction, e.g. Non-Temporal store. – hr0m Jun 04 '16 at 15:09
@hr0m: That's correct, NT stores are only available in aligned versions. Usually you don't want those, because they forcibly evict the destination from cache. This is bad if you're going to reuse the data later. NT loads don't do anything different on normal (writeback) memory, on current microarchitectures. On current CPUs, NT loads are only useful when loading from WC memory (e.g. video RAM). – Peter Cordes Jun 04 '16 at 15:14
You didn't mention the major complication for the OP: if the input and output have different alignments, the only way to do aligned loads and stores is `palignr`. (Not worth it on Nehalem and later). – Peter Cordes Jun 04 '16 at 15:16
@PeterCordes: You are not completely wright about NT-stores. Usually your cache is not large enough to keep the whole array in. I am studying in the simulation field, and large arrays are the normal condition. So if you are memory-bound, you can probably make use of NT-stores. But thanks for the thought about different alignment of the two arrays. – hr0m Jun 04 '16 at 17:19
@hr0m: have you looked into cache-blocking, aka tiling, where you do a bunch of work on a subset of your data small enough to fit in cache, before moving on to the next chunk? If the dependencies between different parts of your data are predictable, this can be a better option than just being memory-bound (esp. if you can multi-thread, taking advantage of the private L2 cache in each core). Obviously it takes more work to implement an algorithm that way, but it's the key to getting a matmul to saturate the execution units instead of the memory bus, for example. – Peter Cordes Jun 04 '16 at 19:01
@PeterCordes, you're totally right, I overlooked `out`. But I'm not sure it matters now because it appears the OP is interested in non-temporal stores. In that case `in` can be read unaligned and `out` can be aligned as I explained in my answer and use temporal-stores. – Z boson Jun 04 '16 at 19:21
@hr0m, you should have stated that you were interested in non-temporal stores. I went into a lot of detail about using them [here](http://stackoverflow.com/a/26256216/2542702). It appears to me that you don't need aligned loads only aligned stores (or rather non-temporal stores which must be aligned) in which case the solution in my answer should work fine for you. You would likely increase the bandwidth using multiple thread as well. Keep in mind that since Ivy Bridge it's better to use `enhanced rep movsb` than non-temporal stores. – Z boson Jun 04 '16 at 19:26
@Zboson: That's only true if you have one big copy. This isn't memcpy, this is how to store results from AVX calculations. Or do you mean it's actually worth it to store into a 1k scratch buffer (for example) and memcpy that to the final destination? That seems unlikely compared to just using aligned NT stores, unless you will sometimes re-read the data before it's evicted from L3. You definitely don't want the startup overhead of `rep movsb` in 16B or 32B chunks, because it's still non-negligible. – Peter Cordes Jun 04 '16 at 19:31
1

@PeterCordes: Yes i know about cache blocking. There are even better techniques then that. I have reasons to use NT-stores. This is a toy-example (even jacobi is). What will i deal with is a set of different algorithms, and i can't apply cache blocking everywhere. – hr0m Jun 04 '16 at 19:33
@PeterCordes, I mean a big copy. I made that clear in my answer [here](http://stackoverflow.com/a/26256216/2542702). Agner said the rule of thumb is larger than half the size of last level cache. So for a 8MB L3 cache non-temporal stores may make sense to use for sizes above 4MB. – Z boson Jun 04 '16 at 19:33
@Zboson: Thanks for the link. And also the architecture hints – hr0m Jun 04 '16 at 19:34
@hr0m, can you confirm that only the store needs to be aligned because otherwise I have to update my answer with something about `palignr`? – Z boson Jun 04 '16 at 19:36
@Zboson: yes, if Nehalem is truly that fast with unaligned loads, then i am only interested in the out array to be aligend. However please (TO ALL) stop wasting time here, my current task is much wider. – hr0m Jun 04 '16 at 19:39
If i have more "fun" questions about this subject, i will send you a link, since you guys seem to know a lot, ok? But this discussion is really leading nowhere, since my question was quite a specific one, compared to what i am dealing with. – hr0m Jun 04 '16 at 19:41
@hr0m: If you have the choice between unaligned loads or unaligned stores, I forget which is best on which CPU in that case. Since you don't have that choice, you definitely want unaligned loads when you aren't sure that dest alignment == src alignment. When the src does happen to be aligned at runtime, `vmovups` instructions (and memory operands folded into AVX instructions) behave identically to `vmovaps`. (This is the major downside to load-and-shuffle: You always pay the price even when you don't need to. Since the hardware fallback is good enough, this is the way to go.) – Peter Cordes Jun 04 '16 at 19:48
You would need to do aligned loads if you were using NT loads, but [`movntdqa` loads probably don't do anything special on normal WB memory](http://stackoverflow.com/questions/32103968/non-temporal-loads-and-the-hardware-prefetcher-do-they-work-together). It's *not* weakly ordered when used on WB memory, according to [the info I dug up for the "NT loads" section of this answer](http://stackoverflow.com/questions/35516878/acquire-release-semantics-with-non-temporal-stores-on-x64), so I think it can't really behave much differently. – Peter Cordes Jun 04 '16 at 19:54
@hr0m: If your src data is (almost) always 16B-aligned, but often not 32B-aligned, it might make sense to do your loads in two halves. This is gcc's default strategy for unaligned 256b loads/stores with `-mtune=sandybridge` (or IvB), but not Haswell: `vmovups xmm0, [mem]` / `vinsertf128 ymm0, [mem], 1`. If you're bottlenecked on memory bandwidth for stores anyway, this might be good to remove occasional pipeline bubbles from a page-split load. (Page-split loads are still slow until Skylake: ~100c in Haswell vs. 5c in SKL.) – Peter Cordes Jun 04 '16 at 19:58
1

@PeterCordes, I think I mean misunderstood your point. You mean that `enhanced rep mov` is only useful for a big memcpy and the OP is not doing this so for a big array non-temporal stores is still the right choice (assuming the the OP's algorithm is memory bandwidth bound). I have never used `enhanced rep mov`, only read about it. – Z boson Jun 07 '16 at 07:12
@Zboson: exactly: The OP has work to do on the data, not just memcpy. ERMSB is not so much better than MOVNT that it's worth doing regular stores to a 4k bounce buffer and doing `rep movsb` from there to the final destination. (vs. just doing NT stores to the final destination). If the data might sometimes be small enough for L3 hits when it's next read, then maybe, but otherwise not. (Or maybe L4 eDRAM on Broadwell? Skylake L4 is a memory-side cache, so doesn't need to be flushed, even by movnt). – Peter Cordes Jun 07 '16 at 17:01

Process unaligned part of a double array, vectorize the rest

1 Answers1