First of all, I agree with the comment that says you must "manage your memory very close" if you intend to successfully vectorize your loop. In case don't know about that - see footnote on the end of this answer for a very brief and superficial introduction about memory aligment.
However, even if your memory is well aligned, there is another possibility that maybe hold you back. Georg Hager e Gerhard Wellein, authors of the respected book "Introduction to High Performance Computing for Scientists and Engineer", explicit state that C++ operator overloading may prevent loop vectorization
In their own words :
"(....) STL may define this operator in the following way (adapted from the GNU ISO C++ library source):
const T& operator[](size_t __n) const{ return *(this->_M_impl._M_start + __n); }
Although this looks simple enough to be inlined efficiently, current compilers refuse
to apply SIMD vectorization to the summation loop above. A single layer of abstraction, in this case an overloaded index operator, can thus prevent the creation of
optimal loop code."
A good friend convinced me that this is not actually true for stl containers anymore, because compilers can eliminate the layer of indirection associated with operator[]
. But, it seems that you wrote your own container, so you must check if compiler can eliminate the layer of indirection associated with your own operator()
! A good cross check is to provide yourself a way to work directly with the underlying array that your container holds (meaning: write a member function similar to std::vector.data()
and use the C pointers as an "iterator" inside your loop ).
Footnote about memory alignment:
Problem: assume you want to vectorize c[i] = a[i] + b[i]
.
First fact: size(double)
= 8 bytes = 64 bits.
Second fact: There is an assembly instruction that reads 2 doubles in memory and put them on 128 bits register => with one assembly instruction you can read 2 doubles => they can read a[0]
and a[1]
then b[0]
and b[1]
!
Third fact: When you apply the instruction on the register, you make 2 sums of 64 bits double
at the same time.
The problem is that assembly can only read a[0]
and a[1]
at the same time only if a[0]
and b[0]
are in memory addresses that are multiple of 16 (if they are not, he can test if a[1]
and b[1]
is alignment and so forth). That is why memory can be an issue that prevents vectorization. To fix that, you must write container allocators that guarantees that the first element of your container will be written on a memory address that is multiple of 16.
Update: This answer provides a detailed explanation about how to code an allocator that aligns your memory.
Update 2: another useful answer for learning how to code allocators
Update 3: alternative approach using posix_memalign/_aligned_malloc