GCC Hinting at Vectorization

Question

I would like GCC to vectorize the below code.-fopt-info tells me that GCC is not currently. I believe the problem is the strided access of W or possible the backward incrementing of k. Note that height and width are constants and index_type is set to unsigned long currently.

I removed some comments

114  for (index_type k=height-1;k+1>0;k--) {
116    for (index_type i=0;i<width;i++) {
117      Yp[k*width + i] = 0.0;                                                            
119      for (index_type j=0;j<width;j++) {                                                                            
121        Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
122      }
123      Yp[k*width + i] *= DF(Ap[k*width + i]);
124    }
125  }

I am compiling with gcc -O3 -ffast-math -fopt-info -std=c11 ./neural.c -o neural -lm

Is there a good way to make this vectorize? Can you refer me to further information?

Is my method for indexing a bad idea (ie the k*width*width + ...)? I need to dynamically allocate, and I thought that keeping things close in memory would better enable optimizations.

EDIT: This might be useful

The -fopt-info-missed output for these lines

./neural.c:114:3: note: not vectorized: multiple nested loops.
./neural.c:114:3: note: bad loop form.
./neural.c:116:5: note: not vectorized: control flow in loop.
./neural.c:116:5: note: bad loop form.
./neural.c:119:7: note: step unknown.
./neural.c:119:7: note: reduction used in loop.
./neural.c:119:7: note: Unknown def-use cycle pattern.
./neural.c:119:7: note: not vectorized: complicated access pattern.
./neural.c:119:7: note: bad data access.
./neural.c:110:21: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:58: note: not vectorized: not enough data-refs in basic block.
./neural.c:110:62: note: not vectorized: not enough data-refs in basic block.
./neural.c:117:18: note: not vectorized: not enough data-refs in basic block.
./neural.c:114:37: note: not vectorized: not enough data-refs in basic block.

EDIT:

Minimal example is HERE

I am trying it with BLAS. In the minimal example, it goes faster, but on the whole code it is slower ... not sure why

EDIT:

Compiler was optimizing out code. Fixed. BLAS is now faster. The fix was on the whole code, not the minimal example.

EDIT:

Same code as in the link from the previous edit

#include <math.h>
#include <cblas.h>
#include <stdlib.h>
#include <stdio.h>

typedef float value_type;
typedef unsigned long index_type;

static value_type F(value_type v) {
  return 1.0/(1.0 + exp(-v));
}

static value_type DF(value_type v) {
  const value_type Ev = exp(-v);
  return Ev/((1.0 + Ev)*(1.0 + Ev));
}

#ifndef WITH_BLAS

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    for (index_type i=0;i<width;i++) {
      Yp[k*width + i] = 0.0;
      for (index_type j=0;j<width;j++) {
    Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
      Yp[k*width + i] *= DF(Ap[k*width + i]);
    }
  }
}

#else

static void get_Yp(const value_type * __restrict__ Ap, const value_type * __restrict__ W,
           value_type * __restrict__ Yp, const value_type * __restrict__ Dp,
           const index_type height, const index_type width) {
  for (index_type i=0;i<width;i++) {
    Yp[height*width + i] = 2*DF(Ap[height*width + i])*(Dp[i] - F(Ap[height*width + i]));
  }

  for (index_type k=height-1;k+1>0;k--) {
    cblas_sgemv(CblasRowMajor, CblasTrans, width, width, 1,
        W+k*width*width, width, Yp+(k+1)*width, 1, 0, Yp+k*width, 1);
    for (index_type i=0;i<width;i++)
      Yp[k*width + i] *= DF(Ap[k*width + i]);
  }
}

#endif

int main() {
  const index_type height=10, width=10000;

  value_type *Ap=malloc((height+1)*width*sizeof(value_type)),
    *W=malloc(height*width*width*sizeof(value_type)),
    *Yp=malloc((height+1)*width*sizeof(value_type)),
    *Dp=malloc(width*sizeof(value_type));

  get_Yp(Ap, W, Yp, Dp, height, width);
  printf("Done %f\n", Yp[3]);

  return 0;
}

Try creating a Minimal, Complete and Verifiable Example that reproduces the problem. — Ross Ridge, Jan 18 '16 at 18:01
I'm not sure [vectorization](http://stackoverflow.com/tags/vectorization/info) is the right tag. — Z boson, Jan 18 '16 at 21:55
Try using `sum` instead of `Yp[k*width + i]` in the reduction and then after the reduction doing `Yp[k*width + i] = sum`. But you really should provide a minimal example as @RossRidge said. Also, keep in mind that GCC does not unroll with reductions anyway. ICC unrolls twice, and Clang four times. Four times should be good across many Intel processors (since Haswell you may need to unroll even more than four times but it's more difficult to achieve this) so for auto-vectorization of reduction Clang is best currently. — Z boson, Jan 18 '16 at 21:59
You're a student in particle physics at UCSD. I know a few of the profs in particle physics there well (at least I have talked with them several times). Are you going to provide a minimal working example? — Z boson, Jan 20 '16 at 14:54
The minimal example was edited into the post yesterday. Can you not get to it? I put a link to it rather than putting all the code. Perhaps I should just put the code into the question. Who do you know? I work under Frank in experimental CMS group. — , Jan 20 '16 at 15:21
I may know Frank. Is he in charge of the Tier2 (or Tier3) grid at UCSD? If so then I met him at USCD once and probably at CERN. I know Avi and Vivek. I worked on CMS for four years at CERN. — Z boson, Jan 20 '16 at 16:50
No, I haven't been yet. Definitely on my to do list. The code is for a neural network. I am trying to learn how to write high performance code and profiling it to find potential improvements. I have done a good amount (for my age) of machine learning work, so this seemed like a good practice run. — , Jan 20 '16 at 23:08
I tested a sample of your code. It did not vectorize when (as you observe ) using `Yp[k*width + i] += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];`. However, it does vectorize using `sum += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];`. — Z boson, Jan 21 '16 at 09:08
`W[k*width*width + j*width + i]` is inefficient over `j` because each iteration it has to jump by `width` and x86 SIMD gather is not efficient. — Z boson, Jan 21 '16 at 09:31

score 0 · Accepted Answer · edited May 23 '17 at 12:04

0

j-loop is well vectorizable SIMD reduction loop with constant stride of "width"-elements. You can vectorize it using modern compilers. This code is vectorizable with Intel Compiler and should be vectorizable by GCC under some conditions.
First of all, Reduction is a special case of "vectorizable" true loop-carried dependency. So you can't safely vectorize it unless "reduction" pattern is (a) auto-recognized by Compiler (not so easy and strictly speaking not so valid/expected behavior) or (b) communicated by developer to compiler explicitely using OpenMP or similar standards.

To "communicate" to compiler that there is a reduction - you need to use #pragma omp simd reduction (+ : variable_name) before j-loop.

This is only supported starting from OpenMP4.0. So you have to use GCC version which supports OpenMP4.x. Quote from https://gcc.gnu.org/wiki/openmp: "GCC 4.9 supports OpenMP 4.0 for C/C++, GCC 4.9.1 also for Fortran"

I would also use temporarily local variable for accumulating reduction ( OpenMP4.0 requires reduction variable to be used that way):

 tmpSUM = 0; 
 #pragma omp simd reduction (+: tmpSUM) 
 for (index_type j=0;j<width;j++) {                                                                            
        tmpSUM += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
      }
 Yp[k*width + i] = tmpSUM

I'd also suggest to use signed int instead of unsigned, because unsinged induction variables are pretty bad for all modern vectorizers, at least by introducing extra overhead. I would not be surpised if using unsigned was one of the main reasons, "confused" GCC.
Now, you could be not-satisfied with my reply, because it tells about how it should work (and how it works in ICC/ICPC). It doesn't take into account specific nuances of GCC (which for reduction seemed to do oppostie), as one can see in GCC optimization report.

So, if you are still limited to GCC, I would suggest:

Make sure it's fresh enough GCC (4.9 or higer)
Use signed induction variable and still try omp simd reduction on temporary local tmp SUM(because it should enable more advanced vectorization techniques anyway)
If none above help, take a look at "weird" things like described here (very similar to your case): What do gcc's auto-vectorization messages mean? or consider using other compiler/compiler version.
1. Last comment: is access pattern in your code and more generally const-stride so bad? Answer: for some platforms/compilers const-stride will not kill your performance. But still, ideally you need more cache-friendly algorithm. Check Not sure how to explain some of the performance results of my parallelized matrix multiplication code . One more option is considering MKL or BLAS (like suggested by other reply) if your code is really memory-bound and if you don't have time to deal with memory optimizations yourself.

edited May 23 '17 at 12:04

Community

1
1

answered Jan 20 '16 at 11:01

zam

1,664
9
16

I am confident I could get GCC to vectorize this without using `omp simd` if the OP just provided a working example. Of course I can't say for sure until I see the code but currently it does not look so difficult to me. – Z boson Jan 20 '16 at 14:52
@Z boson. I agree with you and this is my expecation too; and you should be 100 times more experienced with GCC than me. in my reply I wanted to focus on "fornmal"/generalized answer, particularly taking in mind that reduction (as proven dependence in given case) is formally illegal to auto-vectorize (depends on which programming model philosophy you follow; old Cray guys would not agree). – zam Jan 20 '16 at 15:38
The OP used `-ffast-math` which allows the floating point reduction. Unfortunately GCC does not unroll the loop. But neither does `omp simd`. If OpenMP simd had unroll it would be useful. – Z boson Jan 20 '16 at 16:46
Don't mixture fp model preventing vectorization with proven dependency formally preventing auto-vectorization of reduction regardless fp model. In intel compiler you can combine unroll and omp simd pragmas, although it's true that OMP4.x lacks many important simd programming model capabilities – zam Jan 20 '16 at 20:07
I know what you mean. I have not tried the code yet. Do you see an obvious dependency? – Z boson Jan 20 '16 at 20:33
I really don't see why signed vs unsigned would make a difference. It did not make a difference in my tests. In fact I think it's better to use unsigned. Even best is `size_t` which is unsigned anyway. – Z boson Jan 21 '16 at 10:19
When using unsigned there is a need for extra check for max_int overflow – zam Jan 21 '16 at 15:16
@Z_boson for dependnecy. Yes, I see dependence. Reduction is always WAW (or RAW depending on your taxonomy view) dependence. It's just that reduction is an example of dependence (along with things like compress/expand dependence idiom) which don't prevent SIMD parallelism in modern compilers. – zam Jan 21 '16 at 15:18
Look you don't need `#pragma omp simd reduction` to do the SIMD reduction nor do you need signed integers. You just need a local temporary variable (like I said in one of my first comments in the OPs question) and associative math. – Z boson Jan 21 '16 at 15:46
I told from what you need with Intel Compiler AND GCC and in case you want to strictly follow OpenMP standard. I didn't originally look at your original comment to OP, but yes, looks like it helped in given case. – zam Jan 21 '16 at 19:14
I need/want to learn more on the subject of high performance coding. I still feel like I am guessing a lot when I try to implement optimal code. Is this more of a book learnt subject like physics, or a trial-error? If book learnt, could you recommend books/websites/journals? I am currently reading "Introduction to High Performance Computing for Scientists and Engineers" by Horst Simon – Jan 21 '16 at 19:22
IMO HPC topics (including SIMD vectorization) should be started from book-learnt approach.There are many good books on HPC in general, or on topics like multi-threading or memory optimizations in particular. However for SIMD vectorization there are not many state-of-art book-style resources. Overall, the HPC industry is moving faster than books writers. I will post few book/article links on SIMD in next comment – zam Jan 21 '16 at 19:32
1. Aart Bick (ex Intel) has wrote brilliant book 15 years ago: http://www.amazon.com/Software-Vectorization-Handbook-The-Performance/dp/0974364924 . Unfortunately it's enormously obsolete these days. – zam Jan 21 '16 at 19:33
2. Another pretty obsolete, but well done SHORT reference/article: https://d3f8ykwhia686p.cloudfront.net/1live/intel/CompilerAutovectorizationGuide.pdf Remember that it's Intel specific and don't forget that 50% of syntax (especially for compilation options) has changed in newer compilers/standards – zam Jan 21 '16 at 19:35
3. One example of good article/chapter explaining OpenMP4 basics is here: http://rd.springer.com/chapter/10.1007%2F978-3-642-30961-8_5 (click Download Chapter) – zam Jan 21 '16 at 19:40
4. Look at recent 3 James Reinders-edited books, like given one: https://www.safaribooksonline.com/library/view/high-performance-parallelism/9780128038901/ . I don't recommend to read all content there and you should understand that most of them are really describing potentially controversial ways to address the same problem by different customers/engineers. I.e. it's more like references of use cases, rather than book – zam Jan 21 '16 at 19:41
5. Here is one of most recent attempt to provide good reference to all vectorization resources published by intel in last 3-5 years: https://software.intel.com/en-us/articles/vectorization-resources-advisor Don't be confused that it starts from Vector Advisor part; later in the page you could find big reference of more general links – zam Jan 21 '16 at 19:43
Most links provided by me are more or less related to Intel. However most of content in items (1), (3) and partially (4) should be equally applicable to GCC/non-intel x86 world. – zam Jan 21 '16 at 19:46
The reason why there is no gold standard vectorization book in recent 15 yeas, is simply because vectorization was originally a highlight in 1980s-1990s, while in 2000s SIMD topic became not popular and was not aligned with majority of modern hardware trends. The situation fundamentally changed about 5-7 years ago, when Intel and AMD (and actually ARM) started big investments into SIMD hardware again. Many things have changed in the course of these 5-7 years, so as i said book writers were not so up to speed. Although I know about one good book to be published in June'16. Ping me by then:-) – zam Jan 21 '16 at 19:48
I looked into the signed vs. unsigned issue a bit, though not within the context of vectorization and I think it's debatable. I'll have to reconsider using unsigned for iterators. I don't know the right answer yet but I usually look at the assembly anyway so at least for the cases I test I know what is good and bad. – Z boson Jan 22 '16 at 08:57
Can you point me to a link about the max_int overflow check using unsigned? I would like to read about that. – Z boson Jan 22 '16 at 08:58
@Z_boson. I don't know a link and i don't think documentation exists. The only way to check is to compare icc/ifort assembly. But I can try to explain; there are 2 reasons why unsigned is "bad": (1) according to language standard a[i+10] is not unconditionally sequential. It's logically possible that access is not sequential by design. so compiler has to support case when e.g. half of vector iteration is for i= 254,255, while second half is for i =0, 1. (2) Loops like "for (unsigned i = A; i != B; i++)" cannot be transformed to canonical i – zam Jan 27 '16 at 20:35

score 0 · Answer 2 · edited May 23 '17 at 11:59

I tested the following code in GCC 5.3.1, Clang 3.7.1 , ICC 13.0.1, and MSVC 2015

void foo(float *x) 
    unsigned i;
    for (i=0; i<1024; i++) x[0] += x[1024 + i];
}

I used -Ofast for GCC, Clang, and ICC and /O2 /fp:fast with MSVC. Looking at the assembly shows that only ICC managed to vectorize the loop.

However, with the same compile options all the compilers vectorized the following code

void foo2(float *x) {
    float sum = 0.0f;
    unsigned i;
    for (i=0; i<1024; i++) sum += x[1024 + i];
    x[0] = sum;
}

I'm not sure why only ICC vectorized foo. It seems clear to me that there is no dependency.

GCC does not unroll the loop (and -funroll-loops does not help). MSVC unroll two times, Clang unrolls four times, and ICC eight times. Intel processors since at et least Core2 have a latency of at least 3 cycles for addition so unrolling four or more times is going to work better than twice or none at all.

In any case using

value_type sum = 0.0;
for (index_type j=0;j<width;j++) {
    sum += W[k*width*width + j*width + i]*Yp[(k+1)*width + j];
}
Yp[k*width + i] = sum*DF(Ap[k*width + i]);

vectorizes your loop.

score -1 · Answer 3 · answered Jan 18 '16 at 18:02

-1

In my experience, demanding GCC to vectorize properly is toublesome. Especially if you hope to fully utilize modern hardware (e.g. AVX2). I deal a lot with vectorization in my code. My solution: Do not attempt to use GCC for vectorization at all. Rather formulate all operations you'd like to be vectorized in terms of BLAS (basic linear algebra subprograms) calls, and link with the OpenBLAS library which implements BLAS for modern hardware. OpenBLAS also offers parallelization on shared memory nodes (either with pthreads or OpenMP), which - depending on your application - can be very important to further increase execution speed: in this way you can combine vectorization with (shared memory) parallelization.

Furthermore, I suggest that you align your memory to fully utilize AVX and/or AVX2 etc. . I.e. do not allocate memory with malloc or new, use memalign or aligned_alloc (depending on what your system supports). For example, in case you intend to use AVX2 you should align your allocation so that the address is a multiple of 64 bytes (8 * 8 doubles).

answered Jan 18 '16 at 18:02

sperber

661
6
20

Thank you. Can you refer me to further information? – Jan 18 '16 at 18:05
Sure, it depends on what you need to know. In principle it is easy to utilize OpenBLAS, but there are a few potential pitfalls (especially if you intend to use it for computations on a cluster). The main source of information is the GitHub page of OpenBLAS. Apart from MKL, OpenBLAS is probably the fastest and best developed implementation of the BLAS interface, but in contrast to MKL is freely available. So there is quite a community out there that can help with problems... – sperber Jan 18 '16 at 18:12
I like GCC for auto-vectorization and when I don't like it I use intrinsics (okay I use intrinsics most of the time for SIMD anyway) so I'm not a big fan of your answer. – Z boson Jan 18 '16 at 22:08
1

@Zboson: I didn't say you cannot do it differently. But you have to understand: Not everyone (especially in science applications) has the TIME to delve so deeply into the workings of the compiler (here in the context of vectorization) to arrive at satisfactory results - especially if you have to deal with a dozen other aspects of a project that do not involve computers or programming, and if a lot of portable code has to be produced on a regular basis. This is a solution that gets most out of the machine (and probably future machines!) without a big investment. – sperber Jan 18 '16 at 23:38
BTW, I did not down vote you. Mostly because it's something I rarely do. – Z boson Jan 20 '16 at 14:50

GCC Hinting at Vectorization

3 Answers3

Linked