1

I am trying to learn more about auto vectorization in gcc. In my project I have to use gcc 4.8.5 and I have some loops that i see that are not vectorized. Thus I have created a small example to play and to see why they are not.

What I am interested in is the fact that gcc does not vectorize the loop and to find out how I can vectorize it. Unfortunately I am not very familiar with the output messages of GCC.

a) I would expect that this loop would be vectorized as a trivial case

b) Is there anything trivial that I am missing?

Thank you all very much in advance ...

The small example is:

#include <iostream>
#include <vector>

using namespace std;

class test
{

public:
    test();
    ~test();
    void calc_test();
};

test::test()
{
}

test::~test()
{
}

void
test::calc_test(void)
{
vector<int> ffs_psd(10000,5.0);
vector<int> G_qh_sp(10000,1.0);
vector<int> G_qv_sp(10000,3.0);
vector<int> B_erm_qh(10000,50.0);
vector<int> B_erm_qv(10000,2.0);


for ( uint ang=0; ang < 6808; ang++)
{
   ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang])  +  (G_qv_sp[ang] * B_erm_qv[ang]);      
}

}

int main(int argc, char * argv[])
{
  test m_test;
  m_test.calc_test();
}

I compile it with gcc 4.8.5 :

c++ -O3 -ftree-vectorize -fopt-info-vec-missed -ftree-vectorizer-verbose=5 -std=c++11 test.cpp

The output that I get from the compiler is:

test.cpp:34: note: ===vect_slp_analyze_bb===

test.cpp:34: note: === vect_analyze_data_refs ===

test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: get vectype with 4 units of type value_type
test.cpp:34: note: vectype: vector(4) int
test.cpp:34: note: === vect_pattern_recog ===
test.cpp:34: note: vect_is_simple_use: operand _27
test.cpp:34: note: def_stmt: _27 = (long unsigned int) ang_212;

test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand ang_212
test.cpp:34: note: def_stmt: ang_212 = PHI <ang_43(78), 0(76)>

test.cpp:34: note: type of def: 2.
test.cpp:34: note: vect_is_simple_use: operand 4
test.cpp:34: note: vect_recog_widen_mult_pattern: detected: 
test.cpp:34: note: get vectype with 4 units of type uint
test.cpp:34: note: vectype: vector(4) unsigned int
test.cpp:34: note: get vectype with 2 units of type long unsigned int
test.cpp:34: note: vectype: vector(2) long unsigned int
test.cpp:34: note: patt_2 = ang_212 w* 4;

test.cpp:34: note: pattern recognized: patt_2 = ang_212 w* 4;

test.cpp:34: note: vect_is_simple_use: operand _29
test.cpp:34: note: def_stmt: _29 = *_67;

test.cpp:34: note: type of def: 3.
test.cpp:34: note: vect_is_simple_use: operand _34
test.cpp:34: note: def_stmt: _34 = *_69;

test.cpp:34: note: type of def: 3.
test.cpp:34: note: === vect_analyze_dependences ===
test.cpp:34: note: can't determine dependence between *_67 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_68 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_69 and MEM[(value_type &)__first_111]
test.cpp:34: note: can't determine dependence between *_70 and MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_refs_alignment ===
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_125
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_153
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_139
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: SLP: step doesn't divide the vector-size.
test.cpp:34: note: Unknown alignment for access: *__first_167
test.cpp:34: note: vect_compute_data_ref_alignment:
test.cpp:34: note: can't force alignment of ref: MEM[(value_type &)__first_111]
test.cpp:34: note: === vect_analyze_data_ref_accesses ===
test.cpp:34: note: not consecutive access MEM[(value_type &)__first_111] = _41;

test.cpp:34: note: === vect_analyze_slp ===
test.cpp:34: note: Failed to SLP the basic block.
test.cpp:34: note: not vectorized: failed to find SLP opportunities in basic block.

EDIT : After Matts answer below:

@Matt :

Thanks a lot for your answer. I did not know that the vector is not aligned. This information is very useful because many people would just take as granted that a loop will be vectorized even if they use a vector as a container.

Unfortunately even with your changes the report from gcc is that still is not vectorized (with different messages this time):

test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];

test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.

test.cpp:47: note: misalign = 0 bytes of ref MEM[(value_type &)&ffs_psd]
test.cpp:47: note: not consecutive access _25 = MEM[(value_type &)&ffs_psd];

test.cpp:47: note: Failed to SLP the basic block.
test.cpp:47: note: not vectorized: failed to find SLP opportunities in basic block.

The assembly output is (hopefully I copy paste the correct section cause my assembly knowledge is not very good) :

.L16
vmovdqa 40000(%rsp,%rax), %ymm1
vmovdqa 80000(%rsp,%rax), %ymm0
vpmulld 120000(%rsp,%rax), %ymm1, %ymm1
vpmulld 160000(%rsp,%rax), %ymm0, %ymm0
vpaddd  %ymm0, %ymm1, %ymm0
vpaddd  (%rsp,%rax), %ymm0, %ymm0
vmovdqa %ymm0, (%rsp,%rax)
addq    $32, %rax
cmpq    $27232, %rax
jne .L16
  • By allowing undefined behaviour in your program, the compiler's already not guaranteed to abide by "normal" rules: just as a general rule to ensure you have a proper example, you should fix the UB. – hnefatl Nov 01 '18 at 15:41
  • The Undefined behavior happens only at run time and has nothing to do with the optimization gcc tries to implement. At least to my knowledge... – Ioannis Nestoras Nov 01 '18 at 15:46
  • Nope, UB can in general often be detected at compile-time and affect the choices the compiler makes when generating target code. See eg. example 9 [on this webpage](https://alexpolt.github.io/undefined.html) for a dramatic example. – hnefatl Nov 01 '18 at 15:48
  • The compiler doesn't like the fact that the vectors are ampty, but even then, it's not goign to vectorize. Newer gcc do, so it could a bug that was solved since then. – Matthieu Brucher Nov 01 '18 at 15:49
  • 2
    @IoannisNestoras _"The Undefined behavior happens only at run time and has nothing to do with the optimization gcc tries to implement."_ No, quite the opposite. Constructs deemed to have undefined behaviour by the standard result in implementations _assuming that you have not used those constructs_. This can have a huge impact on code generation, _especially_ optimisations, and naturally code generation affects 99% of what happens at runtime. – Lightness Races in Orbit Nov 01 '18 at 15:51
  • You didn't enable any `-march` option, so GCC cannot make a lot of assumptions about what instructions are available on the target machine. A x86_64 processor is guaranteed to have SSE1 only, which doesn't have integer SIMD instructions. If you compiled for 32 bit i686, then there is only MMX available. – eerorika Nov 01 '18 at 15:56
  • You were all correct about the undefined behavior. I have changed the code above and the corresponding output. The result is the same as the loop is not vectorized. – Ioannis Nestoras Nov 01 '18 at 15:56
  • Even with the -march option the result is the same. The machine is x86_64 and the result is always a non vectorized loop with any compiler option I use – Ioannis Nestoras Nov 01 '18 at 15:58
  • @IoannisNestoras which `-march` option? There are many to choose from. Not all of them support all SIMD instructions. – eerorika Nov 01 '18 at 15:59
  • I have used -march=native I even tried to use -msse -msse2 and -mavx but with the same result – Ioannis Nestoras Nov 01 '18 at 16:02

1 Answers1

1

In order to use vectorized instructions the operands need to be aligned along the proper boundaries. For example __attribute__((aligned(32))) or __attribute__((aligned(16))) etc. The standard allocator for std::vector does not guarantee alignment even if the class is aligned. For example std::vector<__m64> A creates a vector of SSE data types but they may not be aligned because std::allocator doesn't align everything. In my opinion the simplest change is to use a std::array with __attribute__((aligned(32)))

#include <iostream>
#include <array>

using namespace std;

int main()
{
    array<int, 10000> ffs_psd __attribute__((aligned(32)));
    ffs_psd.fill(5);
    array<int, 10000> G_qh_sp __attribute__((aligned(32)));
    G_qh_sp.fill(1);
    array<int, 10000> G_qv_sp __attribute__((aligned(32)));
    G_qv_sp.fill(3);
    array<int, 10000> B_erm_qh __attribute__((aligned(32)));
    B_erm_qh.fill(50);
    array<int, 10000> B_erm_qv __attribute__((aligned(32)));
    B_erm_qv.fill(2);


    for ( uint ang=0; ang < 6808; ang++)
    {
        ffs_psd[0] += (G_qh_sp[ang] * B_erm_qh[ang])  +  (G_qv_sp[ang] * B_erm_qv[ang]);      
    }
    cout << ffs_psd[0] << endl;
}

The loop produces this:

vmovdqa ymm2, YMMWORD PTR [rsp+40000+rax]
vmovdqa ymm1, YMMWORD PTR [rsp+80000+rax]
vpmulld ymm2, ymm2, YMMWORD PTR [rsp+120000+rax]
vpmulld ymm1, ymm1, YMMWORD PTR [rsp+160000+rax]
add     rax, 32
vpaddd  ymm1, ymm2, ymm1
cmp     rax, 27232
vpaddd  ymm0, ymm0, ymm1
jne     .L13
vmovdqa xmm1, xmm0

on Godbolt with GCC 4.8.3 -std=c++11 -Wall -Wextra -pedantic-errors -O2 -ftree-vectorize -march=native

Another option is to use boost::alignment::aligned_allocator with your vector.

Finally you can write your own allocator that vector can use to properly align things. Here is an article explaining the requirements for an allocator. Also here is a SO question about the same basic thing.

Matt
  • 2,554
  • 2
  • 24
  • 45
  • Thanks a lot Matt for the info. Please see above my edit. The problem remains unfortunately even with your changes. – Ioannis Nestoras Nov 02 '18 at 10:55
  • @IoannisNestoras remove all the other functions. I believe the messages you are seeing is that gcc failed to vectorize the other things. This is expected because there is nothing to vector in them. The instructions you pasted appear to be SIMD instructions. – Matt Nov 02 '18 at 16:13
  • Thanks a lot Matt. After reading about the instructions yes I see that are SIMD instructions. But is a bit misleading the gcc messages about the vectorization. I hope in newer versions this is better. Thanks again... – Ioannis Nestoras Nov 05 '18 at 13:02