How to vectorize my loop with g++?

Question

The introductory links I found while searching:

As you can see most of them are for C, but I thought that they might work at C++ as well. Here is my code:

template<typename T>
//__attribute__((optimize("unroll-loops")))
//__attribute__ ((pure))
void foo(std::vector<T> &p1, size_t start,
            size_t end, const std::vector<T> &p2) {
  typename std::vector<T>::const_iterator it2 = p2.begin();
  //#pragma simd
  //#pragma omp parallel for
  //#pragma GCC ivdep Unroll Vector
  for (size_t i = start; i < end; ++i, ++it2) {
    p1[i] = p1[i] - *it2;
    p1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x,y;
    n = 12800000;
    vector<double> v,u;
    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }
    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(v,0,n,u);
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;
    return 0;
}

I used al the hints one can see commented above, but I did not get any speedup, as a sample output shows (with the first run having uncommented this #pragma GCC ivdep Unroll Vector:

samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -funroll-loops -ftree-vectorize -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.026575 seconds.
samaras@samaras-A15:~/Downloads$ g++ test.cpp -O3 -std=c++0x -o test
samaras@samaras-A15:~/Downloads$ ./test
It took me 0.0252697 seconds.

Is there any hope? Or the optimization flag O3 just does the trick? Any suggestions to speedup this code (the foo function) are welcome!

My version of g++:

samaras@samaras-A15:~/Downloads$ g++ --version
g++ (Ubuntu 4.8.1-2ubuntu1~12.04) 4.8.1

Notice that the body of the loop is random. I am not interesting in re-writing it in some other form.

EDIT

An answer saying that there is nothing more that can be done is also acceptable!

So did you look at the assembly to see if it's already vectorized under `-O3`? — Mysticial, Mar 27 '15 at 03:45
Oh damn, no I did not. I am going to do so, by checking this question: http://stackoverflow.com/questions/1289881/using-gcc-to-produce-readable-assembly Good idea @Mysticial! — gsamaras, Mar 27 '15 at 03:48
@Mysticial maybe the answer given by David makes the reading of the assembly not needed? — gsamaras, Mar 27 '15 at 03:49
I'm not sure if the compiler is even allowed to vectorize that loop. How does it know that `p1` and `p2` do not alias? — Mysticial, Mar 27 '15 at 03:52
By not alias you mean that they are surely different? There is the `ivdep` hint one link I posted describes, but I am not sure if that answeres your question @Mysticial. — gsamaras, Mar 27 '15 at 03:54
I tired to read the assembly, but I am getting really different results than the link I posted in the comments. The output is too big. — gsamaras, Mar 27 '15 at 03:57

score 15 · Accepted Answer · edited Sep 06 '21 at 14:57

The O3 flag turns on -ftree-vectorize automatically. https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-loop-vectorize, -ftree-loop-distribute-patterns, -ftree-slp-vectorize, -fvect-cost-model, -ftree-partial-pre and -fipa-cp-clone options

So in both cases the compiler is trying to do loop vectorization.

Using g++ 4.8.2 to compile with:

# In newer versions of GCC use -fopt-info-vec-missed instead of -ftree-vectorize
g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorize -ftree-vectorizer-verbose=1 -o test

Gives this:

Analyzing loop at test.cpp:16                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                
Vectorizing loop at test.cpp:16                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
test.cpp:16: note: create runtime check for data references *it2$_M_current_106 and *_39                                                                                                                                                                                    
test.cpp:16: note: created 1 versioning for alias checks.                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                
test.cpp:16: note: LOOP VECTORIZED.                                                                                                                                                                                                                                         
Analyzing loop at test_old.cpp:29                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                
test.cpp:22: note: vectorized 1 loops in function.                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                
test.cpp:18: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:16: note: Unroll loop 7 times                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                
test.cpp:28: note: Unroll loop 1 times

Compiling without the -ftree-vectorize flag:

g++ test.cpp -O2 -std=c++0x -funroll-loops -ftree-vectorizer-verbose=1 -o test

Returns only this:

test_old.cpp:16: note: Unroll loop 7 times

test_old.cpp:28: note: Unroll loop 1 times

Line 16 is the start of the loop function, so the compiler is definitely vectorizing it. Checking the assembler confirms this too.

I seem to be getting some aggressive caching on the laptop I'm currently using which is making it very hard to accurately measure how long the function takes to run.

But here's a couple of other things you can try too:

Use the __restrict__ qualifier to tell the compiler that there is no overlap between the arrays.
Tell the compiler the arrays are aligned with __builtin_assume_aligned (not portable)

Here's my resulting code (I removed the template since you will want to use different alignment for different data types)

#include <iostream>
#include <chrono>
#include <vector>

void foo( double * __restrict__ p1,
          double * __restrict__ p2,
          size_t start,
          size_t end )
{
  double* pA1 = static_cast<double*>(__builtin_assume_aligned(p1, 16));
  double* pA2 = static_cast<double*>(__builtin_assume_aligned(p2, 16));

  for (size_t i = start; i < end; ++i)
  {
      pA1[i] = pA1[i] - pA2[i];
      pA1[i] += 1;
  }
}

int main()
{
    size_t n;
    double x, y;
    n = 12800000;
    std::vector<double> v,u;

    for(size_t i=0; i<n; ++i) {
        x = i;
        y = i - 1;
        v.push_back(x);
        u.push_back(y);
    }

    using namespace std::chrono;

    high_resolution_clock::time_point t1 = high_resolution_clock::now();
    foo(&v[0], &u[0], 0, n );
    high_resolution_clock::time_point t2 = high_resolution_clock::now();

    duration<double> time_span = duration_cast<duration<double>>(t2 - t1);

    std::cout << "It took me " << time_span.count() << " seconds.";
    std::cout << std::endl;

    return 0;
}

Like I said I've had trouble getting consistent time measurements, so can't confirm if this will give you a performance increase (or maybe even decrease!)

No difference! Maybe the `-unroll-loops` is already enabled by O2, but I could not confirm it. If you have any other suggestion, use the edit button ( recommended :D ). — gsamaras, Mar 27 '15 at 03:53
Yup I just tried it too, and got no difference, let me try some things and see what I can find :) — David Saxon, Mar 27 '15 at 03:55
If you haven't anything new, I could accept the answer maybe, but you have to let me know! — gsamaras, Mar 29 '15 at 13:29
Sorry I ran out of time, I'm still interested in this though. I'll have more of a look tonight :) Don't accept the current answer yet since I ran that with an older version of gcc without noticing. — David Saxon, Mar 29 '15 at 23:13
What is the `16` constant in your code? So you mean that -ftree-vectorize does not have an effect. Also, how should I compile the code you wrote? No speedup, with the two ways I compiled it. :/ +1 though for your good try! — gsamaras, Mar 30 '15 at 14:58

score 1 · Answer 2 · answered Mar 30 '15 at 12:39

1

GCC has extensions to the compiler that creates new primitives that will use SIMD instructions. Take a look here for details.

Most compilers say they will auto-vectorize operations but this depends on the compiler pattern matching, but as you imagine this can be very hit and miss.

answered Mar 30 '15 at 12:39

doron

27,972
12
65
103

Interesting, but I am not sure what size should I pass in the attribute, can you guide me through? – gsamaras Mar 30 '15 at 15:08
I think many architectures have 128 bit SIMD registers so keep all your objects 128 bit wide. Also remember SIMD does not speed up data load and store times, it just speeds up arithmetic operations. – doron Mar 31 '15 at 13:06
Still it's not clear how I should do it. Am I restricted to 4 dimensions? An example would help. – gsamaras Mar 31 '15 at 14:04

How to vectorize my loop with g++?

2 Answers2

Linked