6

I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5 to vectorise and monitor it.

If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime() function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.

I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?

superbriggs
  • 629
  • 1
  • 11
  • 19
  • I think you can find sensible information in the [answer](http://stackoverflow.com/a/14717689/771663) to this question. – Massimiliano Feb 13 '13 at 19:52
  • Thank you, that describes how to use SIMD with OpenMP, but it doesn't seem to explain my why already working implementation of SIMD stops working when I use OpenMP. Is there not a way to use both? – superbriggs Feb 13 '13 at 19:58
  • 1
    This also implies that I can only operate on the same number of bits, they are just split between the numbers. While doing it with GCC I was not asked how many I wanted to split on to a register. Since I am using a university 'super computer', I had assumed that the hardware ha extra spaces for SIMD. How would I find out if that is correct? – superbriggs Feb 13 '13 at 20:06
  • The hardware is an AMD processor, which will use 3Dnow! – superbriggs Feb 13 '13 at 20:13
  • Ultimately my question is, since the hardware does have specific registers that can hold more to help with vectorisation, how do I do this with GCC considering that the functions given in that link says it will split the normal sized register into chunks. – superbriggs Feb 13 '13 at 20:28

4 Answers4

9

There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   a[i] = b[i] + c[i] * d;

In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:

struct omp_fn_0_s
{
    int N;
    double *a;
    double *b;
    double *c;
    double d;
};

void omp_fn_0(struct omp_fn_0_s *data)
{
    int start, end;
    int nthreads = omp_get_num_threads();
    int threadid = omp_get_thread_num();

    // This is just to illustrate the case - GCC uses a bit different formulas
    start = (data->N * threadid) / nthreads;
    end = (data->N * (threadid+1)) / nthreads;

    for (int i = start; i < end; i++)
       data->a[i] = data->b[i] + data->c[i] * data->d;
}

...

struct omp_fn_0_s omp_data_o;

omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;

GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();

N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;

The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:

struct fun_s
{
   double *restrict a;
   double *restrict b;
   double *restrict c;
   double d;
   int n;
};

void fun1(double *restrict a,
          double *restrict b,
          double *restrict c,
          double d,
          int n)
{
   int i;
   for (i = 0; i < n; i++)
      a[i] = b[i] + c[i] * d;
}

void fun2(struct fun_s *par)
{
   int i;
   for (i = 0; i < par->n; i++)
      par->a[i] = par->b[i] + par->c[i] * par->d;
}

One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1 but fails to vectorise that in fun2 citing the same reason as when it compiles the OpenMP code.

The reason for this is that the vectoriser is unable to prove that par->d does not lie within the memory that par->a, par->b, and par->c point to. This is not always the case with fun1, where two cases are possible:

  • d is passed as a value argument in a register;
  • d is passed as a value argument on the stack.

On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1 if instructed to generate 32-bit x86 code with the -m32 option.

GCC 4.7 gets around this by inserting run-time checks which ensure that neither d nor par->d get aliased.

Getting rid of d removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   a[i] = b[i] + c[i];
Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Great Answer. But could you say more about " This is just to illustrate the case - GCC uses a bit different formulas". What formula does GCC use? – Z boson Mar 23 '14 at 13:28
  • @Zboson, I could paste it here (ugly), but you'd rather run `gcc -fdump-tree-all -fopenmp foo.c` and examine for yourself the AST after the OpenMP expansion, usually located in `foo.c.015t.ompexp`. The difference is that GCC distributes the remainder from the division `r = N % num_threads` by giving one additional iteration to the first `r` threads. – Hristo Iliev Mar 23 '14 at 14:40
3

I'll try to briefly answer your question.

  1. Does OpenMP have its own way of handling vectorisation?

Yes... but starting from the incoming OpenMP 4.0. The link posted above provides a good insight on this construct. The current OpenMP 3.1, on the other hand, is not "aware" of the SIMD concept. What happens therefore in practice (or, at least, in my experience) is that auto-vectorization mechanisms are inhibited whenever an openmp worksharing construct is used on a loop. Anyhow the two concepts are orthogonal and you can still benefit from both (see this other answer).

  1. Do I need to explicitly tell it to?

I am afraid yes, at least at present. I would start rewriting the loops under consideration in a way that makes vectorization explicit (i.e. I will use intrinsics on Intel platform, Altivec on IBM and so on).

Community
  • 1
  • 1
Massimiliano
  • 7,842
  • 2
  • 47
  • 62
  • Thank you very much. Your first link gives the function `VECTOR_ADD`. I have read that that uses one normal sized register to do it, therefore only allowing small numbers to be vectorised. I know my hardware has specific registers to handle SIMD so that that does not happen. Is there a way to make OpenMP use this register? Do I need to use these functions, considering that before GCC did it all for me? I don't see why OpenMP stops this form working. Your second link says that they can both work together, but not how I would acheve this. Thank you very much again. – superbriggs Feb 13 '13 at 20:34
  • The main idea is that OpenMP must not be aware of SIMDization, because you take care of it in VECTOR_ADD. I never used 3Dnow, but on Intel platforms you can use [intrinsics](http://software.intel.com/en-us/articles/how-to-use-intrinsics) to explicitly vectorize the code. The main drawback is that either you lose portability (as intrinsics won't work on other platforms) or readibility/maintainability (because of conditional compilation). – Massimiliano Feb 13 '13 at 20:42
  • For this project maintainability and portability are not important. I am currently not using VECTOR_ADD, I am simply putting it in a loop in a way in which GCC can see what is happening and automatically vectorise it. – superbriggs Feb 13 '13 at 21:26
1

You are asking "why GCC can't do vectorization when OpenMP is enabled?".

It seems that this may be a bug of GCC :) http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032

Otherwise, an OpenMP API may introduce dependency (either control or data) that prevents automatic vectorization. To auto-vertorize, a given code must be data/control-dependency free. It's possible that using OpenMP may cause some spurious dependency.

Note: OpenMP (prior to 4.0) is to use thread-level parallelism, which is orthogonal to SIMD/vectorization. A program can use both OpenMP and SIMD parallelism at the same time.

minjang
  • 8,860
  • 9
  • 42
  • 61
1

I ran across this post while searching for comments about the gcc 4.9 option openmp-simd, which should activate OpenMP 4 #pragma omp simd without activating omp parallel (threading). gcc bugzilla pr60117 (confirmed) shows a case where the pragma omp prevents auto-vectorization which occurred without the pragma.

gcc doesn't vectorize omp parallel for even with the simd clause (parallel regions can auto-vectorize only the inner loop nested under a parallel for). I don't know any compiler other than icc 14.0.2 which could be recommended for implementation of #pragma omp parallel for simd; with other compilers, SSE intrinsics coding would be required to get this effect.

The Microsoft compiler doesn't perform any auto-vectorization inside parallel regions in my tests, which show clear superiority of gcc for such cases.

Combined parallelization and vectorization of a single loop has several difficulties, even with the best implementation. I seldom see more than 2x or 3x speedup by adding vectorization to a parallel loop. Vectorization with AVX double data type, for example, effectively cuts the chunk size by a factor of 4. Typical implementation can achieve aligned data chunks only for the case where the entire array is aligned, and the chunks also are exact multiples of the vector width. When the chunks are not all aligned, there is inherent work imbalance due to the varying alignments.

tim18
  • 580
  • 1
  • 4
  • 8