I want to ensure that gcc knows:
- The pointers refer to non-overlapping chunks of memory
- The pointers have 32 byte alignments
Is the following the correct?
template<typename T, typename T2>
void f(const T* __restrict__ __attribute__((aligned(32))) x,
T2* __restrict__ __attribute__((aligned(32))) out) {}
Thanks.
Update:
I try to use one read and lots of write to saturate the cpu ports for writing. I hope that would make the performance gain by aligned moves more significant.
But the assembly still uses unaligned moves instead of aligned moves.
Code (also at godbolt.org)
int square(const float* __restrict__ __attribute__((aligned(32))) x,
const int size,
float* __restrict__ __attribute__((aligned(32))) out0,
float* __restrict__ __attribute__((aligned(32))) out1,
float* __restrict__ __attribute__((aligned(32))) out2,
float* __restrict__ __attribute__((aligned(32))) out3,
float* __restrict__ __attribute__((aligned(32))) out4) {
for (int i = 0; i < size; ++i) {
out0[i] = x[i];
out1[i] = x[i] * x[i];
out2[i] = x[i] * x[i] * x[i];
out3[i] = x[i] * x[i] * x[i] * x[i];
out4[i] = x[i] * x[i] * x[i] * x[i] * x[i];
}
}
Assembly compiled with gcc 8.2 and "-march=haswell -O3" It is full of vmovups, which are unaligned moves.
.L3:
vmovups ymm1, YMMWORD PTR [rbx+rax]
vmulps ymm0, ymm1, ymm1
vmovups YMMWORD PTR [r14+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r15+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [r12+rax], ymm0
vmulps ymm0, ymm1, ymm0
vmovups YMMWORD PTR [rbp+0+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
and r13d, -8
vzeroupper
Same behavior even for sandybridge:
.L3:
vmovups xmm2, XMMWORD PTR [rbx+rax]
vinsertf128 ymm1, ymm2, XMMWORD PTR [rbx+16+rax], 0x1
vmulps ymm0, ymm1, ymm1
vmovups XMMWORD PTR [r14+rax], xmm0
vextractf128 XMMWORD PTR [r14+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r13+0+rax], xmm0
vextractf128 XMMWORD PTR [r13+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [r12+rax], xmm0
vextractf128 XMMWORD PTR [r12+16+rax], ymm0, 0x1
vmulps ymm0, ymm1, ymm0
vmovups XMMWORD PTR [rbp+0+rax], xmm0
vextractf128 XMMWORD PTR [rbp+16+rax], ymm0, 0x1
add rax, 32
cmp rax, rdx
jne .L3
and r15d, -8
vzeroupper
Using addition instead of multiplication (godbolt). Still unaligned moves.