In my program (written in plain C) I have a structure which holds data prepared to be transformed by vectorized (AVX only) radix-2 2D fast fourier transform. The structure looks like this:
struct data {
double complex *data;
unsigned int width;
unsigned int height;
unsigned int stride;
};
Now I need to load data from memory as fast as possible. As far as I know there exists unaligned and aligned load to ymm registers (vmovapd and vmovupd instructions) and I would like the program to use the aligned version as its faster.
So far I use roughly similar construction for all operations over the array. This example is part of program when data and filter are both already transformed to frequency domain and the filter is applied to data by element by element multiplication.
union m256d {
__m256d reg;
double d[4];
};
struct data *data, *filter;
/* Load data and filter here, both have the same width, height and stride. */
unsigned int stride = data->stride;
for(unsigned int i = 0; i<data->height; i++) {
for(unsigned int j = 0; j<data->width; j+=4) {
union m256d a[2];
union m256d b[2];
union m256d r[2];
memcpy(a, &( data->data[i*stride+j]), 2*sizeof(*a));
memcpy(b, &(filter->data[i*stride+j]), 2*sizeof(*b));
r[0].reg = _mm256_mul_pd(a[0].reg, b[0].reg);
r[1].reg = _mm256_mul_pd(a[1].reg, b[1].reg);
memcpy(&(data->data[i*stride+j]), r, 2*sizeof(*r));
}
}
As expected memcpy calls are optimized. However after observation gcc translates memcpy either to two vmovupd instructions or to bunch of movq instructions which load data to guaranteedly aligned place on stack and then two vmovapd instructions which load it to ymm registers. This behavior depends whether the memcpy prototype is defined or not (if it is defined then gcc uses movq and vmovapd).
I am able to ensure that the data in memory are aligned but I am not sure how to tell gcc that it can just use movapd instructions to load data from memory straight to ymm registers. I strongly suspect that gcc does not know the fact that data pointed by &(data->data[i*stride+j])
are always aligned.
Is there any option how to tell gcc that the data pointed to by a pointer will always be aligned?