17

I wrote some code with static arrays and it vectorizes just fine.

float data[1024] __attribute__((aligned(16)));

I would like to make the arrays dynamically allocated. I tried doing something like this:

float *data = (float*) aligned_alloc(16, size*sizeof(float));

But the compiler (GCC 4.9.2), no longer can vectorize the code. I assume this is because it doesn't know the pointer data is 16 byte aligned. I am getting messages like:

note: Unknown alignment for access: *_43

I have tried adding this line before the data is used, but it doesn't seem to do anything:

data = (float*) __builtin_assume_aligned(data, 16);

Using a different variable and restrict did not help:

float* __restrict__ align_data = (float*) __builtin_assume_aligned(data,16);

Example:

#include <iostream>
#include <stdlib.h>
#include <math.h>

#define SIZE 1024
#define DYNAMIC 0
#define A16 __attribute__((aligned(16)))
#define DA16 (float*) aligned_alloc(16, size*sizeof(float))

class Test{
public:
    int size;
#if DYNAMIC
    float *pos;
    float *vel;
    float *alpha;
    float *k_inv;
    float *osc_sin;
    float *osc_cos;
    float *dosc1;
    float *dosc2;
#else
    float pos[SIZE] A16;
    float vel[SIZE] A16;
    float alpha[SIZE] A16;
    float k_inv[SIZE] A16;
    float osc_sin[SIZE] A16;
    float osc_cos[SIZE] A16;
    float dosc1[SIZE] A16;
    float dosc2[SIZE] A16;
#endif
    Test(int arr_size){
        size = arr_size;
#if DYNAMIC
        pos = DA16;
        vel = DA16;
        alpha = DA16;
        k_inv = DA16;
        osc_sin = DA16;
        osc_cos = DA16;
        dosc1 = DA16;
        dosc2 = DA16;
#endif
    }
    void compute(){
        for (int i=0; i<size; i++){
            float lambda = .67891*k_inv[i],
                omega = (.89 - 2*alpha[i]*lambda)*k_inv[i],
                diff2 = pos[i] - omega,
                diff1 = vel[i] - lambda + alpha[i]*diff2;
            pos[i] = osc_sin[i]*diff1 + osc_cos[i]*diff2 + lambda*.008 + omega;
            vel[i] = dosc1[i]*diff1 - dosc2[i]*diff2 + lambda;
        }
    }
};

int main(int argc, char** argv){
    Test t(SIZE);
    t.compute();
    std::cout << t.pos[10] << std::endl;
    std::cout << t.vel[10] << std::endl;
}

Here is how I am compiling:

g++ -o test test.cpp -O3 -march=native -ffast-math -fopt-info-optimized

When DYNAMIC is set to 0, it outputs:

test.cpp:46:4: note: loop vectorized

but when it is set to 1 it outputs nothing.

Azmisov
  • 6,493
  • 7
  • 53
  • 70
  • Have you tried using new? – cup Jun 17 '15 at 01:16
  • I haven't yet; I didn't think it could guarantee 16 byte alignment. – Azmisov Jun 17 '15 at 01:26
  • 1
    `new float[size]` doesn't give vectorized code and gcc still has `Unknown alignment...` errors – Azmisov Jun 17 '15 at 01:31
  • @Azmisov - what code are you using that is expected to be vectorized? Also, did you try assigning it to a different variable? i.e. `float *alignedData = __builtin_assume_aligned(data, 16);` – EboMike Jun 17 '15 at 01:31
  • Btw, I'm saying that just based on this here: http://locklessinc.com/articles/vectorize/ – EboMike Jun 17 '15 at 01:32
  • @Azmisov How about the std::align in the post mentioned above? – J Trana Jun 17 '15 at 01:33
  • 2
    @EboMike That doesn't seem to help. The code is a little long, but I can try and throw together a minimized version, if I can. – Azmisov Jun 17 '15 at 01:40
  • @JTrana Apparently gcc doesn't have `std::align` implemented yet, as I can't get it to compile. – Azmisov Jun 17 '15 at 01:58
  • Maybe I should upgrade to GCC 5.1 – Azmisov Jun 17 '15 at 02:01
  • You may try platform-specific allocation `http://linux.die.net/man/3/posix_memalign` – Severin Pappadeux Jun 17 '15 at 02:57
  • @Azmisov can you post the loop you are using this? According to the link EboMike said, it is bad to make two loops while trying to do a vectorization. – mr5 Jun 17 '15 at 03:42
  • 2
    @Azmisov also try adding the keyword `restrict` to your pointers. It might help as well. – mr5 Jun 17 '15 at 03:44
  • I think you're on the track, you need to put the `__attribute__((aligned(16)))` on the pointer type, just like it was on the array. Though I would use a typedef or even a struct for this. – o11c Jun 17 '15 at 03:44
  • Can you write the additions you make to the question in the question? It is anoying the gather this from comments. – harper Jun 17 '15 at 04:56
  • `__builtin_assume_aligned` [has worked for me in the past along with restrict](https://stackoverflow.com/questions/23651055/sum-of-overlapping-arrays-auto-vectorization-and-restrict). Post a minimal working example so we can see what is happening. – Z boson Jun 17 '15 at 07:48
  • It's strange that the compiler would no longer vectorize the code because of alignment. Even if it's not aligned it would still vectorize but put in additional code to correct for the misalignment. Maybe it has something to do with one being a pointer and one being an array (the compiler knows the array has 1024 elements but not the size of the memory the pointer references). – Z boson Jun 17 '15 at 07:57
  • 1
    I usually use `_mm_malloc` because it works for GCC, ICC, MSVC, and MinGW. – Z boson Jun 17 '15 at 11:21
  • I added an example. I haven't tried some of the suggestions yet, so I'll keep fiddling with it. – Azmisov Jun 17 '15 at 18:11
  • vectorizing `atan2` is hard. In the static version, `phase` is eliminated since it is never read. – Marc Glisse Jun 17 '15 at 19:03
  • @MarcGlisse You're right; originally, when I was compiling, it was optimizing it to some SSE optimized `atan2`. I updated the example and removed the `sqrt`/`atan2` calculations. But, It still isn't vectorizing. – Azmisov Jun 17 '15 at 19:38
  • 1
    Try renaming `main` to `some_function` and the code now vectorizes ;-) The reason is that `main` is only ever called once, so it is marked as "cold" by the compiler, which disables some optimizations. – Marc Glisse Jun 18 '15 at 05:36

1 Answers1

6

The compiler isn't vectorizing the loop because it can't determine that the dynamically allocated pointers don't alias each other. A simple way to allow your sample code to be vectorized is to pass the --param vect-max-version-for-alias-checks=1000 option. This will allow the compiler to emit all the checks necessary to see if the pointers are actually aliased.

Another simple solution to allow your you example code to be vectorized is to rename main, as suggested by Marc Glisse in his comment. Functions named main apparently have certain optimizations disabled. Named something else, GCC 4.9.2 can track the use of this->foo (and the other pointer members) in compute back to their allocations in Test().

However, I assume something other than your class being used in a function named main prevented your code being vectorized in your real code. A more general solution that allows your code to vectorized without aliasing or alignment checks is to use the restrict keyword and the aligned attribute. Something like this:

typedef float __attribute__((aligned(16))) float_a16;

__attribute__((noinline))
static void _compute(float_a16 * __restrict__ pos,
         float_a16 * __restrict__ vel,
         float_a16 * __restrict__ alpha,
         float_a16 * __restrict__ k_inv,
         float_a16 * __restrict__ osc_sin,
         float_a16 * __restrict__ osc_cos,
         float_a16 * __restrict__ dosc1,
         float_a16 * __restrict__ dosc2,
         int size) {
    for (int i=0; i<size; i++){
        float lambda = .67891*k_inv[i],
            omega = (.89 - 2*alpha[i]*lambda)*k_inv[i],
            diff2 = pos[i] - omega,
            diff1 = vel[i] - lambda + alpha[i]*diff2;
        pos[i] = osc_sin[i]*diff1 + osc_cos[i]*diff2 + lambda*.008 + omega;
        vel[i] = dosc1[i]*diff1 - dosc2[i]*diff2 + lambda;
    }
}

void compute() {
    _compute(pos, vel, alpha, k_inv, osc_sin, osc_cos, dosc1, dosc2,
         size);
}

The noinline attribute is critical, otherwise inlining can cause the pointers to lose their restrictedness and alignedness. The compiler seems to ignore the restrict keyword in contexts other than function parameters.

Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • Ah, very nice. Is there any way to use the `restrict` keyword without passing everything into a separate function? It seems cumbersome to have to do that for every sse function. – Azmisov Jun 18 '15 at 02:34
  • @Azmisov Unfortunately, this is the only way I found that worked. – Ross Ridge Jun 18 '15 at 03:27
  • I believe the handling of restrict with inlining was improved in gcc-5 and the functions aligned_alloc and posix_memalign were marked like malloc as returning disjoint memory regions, but that compiler simply crashes on this testcase :-( – Marc Glisse Jun 18 '15 at 05:34
  • @MarcGlisse, do you have a reference which discusses malloc return disjoint memory regions. That's interesting. – Z boson Jun 18 '15 at 07:52
  • 1
    @Azmisov, another option is to explicitly vectorize your code. Then you don't have to worry about restrict. I use the [Vector Class Library](http://agner.org/optimize/#vectorclass) and am interested in [Yeppp!](http://www.yeppp.info/index.html) as well. – Z boson Jun 18 '15 at 07:56
  • @Zboson If a function is marked with `__attribute__((malloc)))` then GCC can assume that the pointer returned doesn't refer to the storage of any other object. (It also can assume the memory is uninitialized.) In the case of malloc then the program would have to invoke undefined behaviour for that not be true. GCC could take advantage here of this if it can connect the use of `this->pos` in `compute` (and `vel`, `alpha`, etc...) to the allocation of `this->pos` in `Test()` without `this` escaping in between. You can get this to work in the example by renaming main. – Ross Ridge Jun 18 '15 at 15:22
  • 1
    The attribute is documented here: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-g_t_0040code_007bmalloc_007d-function-attribute-3096 – Ross Ridge Jun 18 '15 at 15:26