There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i] * d;
In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for
loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:
struct omp_fn_0_s
{
int N;
double *a;
double *b;
double *c;
double d;
};
void omp_fn_0(struct omp_fn_0_s *data)
{
int start, end;
int nthreads = omp_get_num_threads();
int threadid = omp_get_thread_num();
// This is just to illustrate the case - GCC uses a bit different formulas
start = (data->N * threadid) / nthreads;
end = (data->N * (threadid+1)) / nthreads;
for (int i = start; i < end; i++)
data->a[i] = data->b[i] + data->c[i] * data->d;
}
...
struct omp_fn_0_s omp_data_o;
omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;
GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();
N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;
The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:
struct fun_s
{
double *restrict a;
double *restrict b;
double *restrict c;
double d;
int n;
};
void fun1(double *restrict a,
double *restrict b,
double *restrict c,
double d,
int n)
{
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i] * d;
}
void fun2(struct fun_s *par)
{
int i;
for (i = 0; i < par->n; i++)
par->a[i] = par->b[i] + par->c[i] * par->d;
}
One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict
keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1
but fails to vectorise that in fun2
citing the same reason as when it compiles the OpenMP code.
The reason for this is that the vectoriser is unable to prove that par->d
does not lie within the memory that par->a
, par->b
, and par->c
point to. This is not always the case with fun1
, where two cases are possible:
d
is passed as a value argument in a register;
d
is passed as a value argument on the stack.
On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d
gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d
might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1
if instructed to generate 32-bit x86 code with the -m32
option.
GCC 4.7 gets around this by inserting run-time checks which ensure that neither d
nor par->d
get aliased.
Getting rid of d
removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i];