You're using vmovdqa
instruction, which requires an aligned memory operand, on an unaligned element of the array. Use vmovdqu
instead, for both loads and stores. Or better yet, use memory operands in the actual computation instructions (this is only valid in AVX though; in legacy SSE memory operands of most instructions must be aligned).
There are other inefficiencies and problems with the assembler block. For instance, as mentioned in the comments, you're missing the clobbers, which indicate pieces of CPU and memory state that may be modified by the asm block. In your case, you're missing "memory", "xmm0" and "xmm1" clobbers. Without these, the compiler will assume that the asm block does not affect memory contents (the aResult
array in particular) or the xmm
registers (and, for example, use these registers for its own purposes in conflict with your asm block).
Also, you seem to have messed up the input and output registers in addps
and divps
instructions as you are overwriting or not using results of the previous instructions in a few instances. In AT&T x86 asm syntax used by gcc, the last operand is the output operand. You should normally use the AVX version of every instruction when using any AVX instructions, although mixing 128-bit AVX instructions with legacy SSE won't cause SSE/AVX transition stalls if the upper halves of YMM registers were already clean (e.g. vzeroupper). Use vaddps
/ vdivps
is optional but recommended.
Also, you're passing references to the input and output arrays inefficiently. Instead of passing pointers to the specific elements of the arrays it is more efficient to pass memory references to those, which allows the compiler to use more elaborate memory referencing arguments than a plain pointer. This removes the need to calculate the pointer in a separate instruction before your asm block. Also, it is more efficient to pass the tree
constant in an xmm
register instead of memory. Ideally, you would want to move vbroadcastss
out of the loop, but that is only possible with support for intrinsics. (Or writing the loop inside one asm statement.)
The corrected and improved asm statement would look like this:
__asm__ __volatile__
(
"vcvtdq2ps %1, %%xmm0;"
"vcvtdq2ps %2, %%xmm1;"
"vaddps %%xmm1, %%xmm0;"
"vcvtdq2ps %3, %%xmm1;"
"vaddps %%xmm1, %%xmm0;"
"vbroadcastss %4, %%xmm1;"
"vdivps %%xmm0, %%xmm1;"
"vcvtps2dq %%xmm1, %0;"
: "=m"(aResult[i])
: "m"(cyclic[i]), "m"(cyclic[j]), "m"(cyclic[k]), "x"(three)
: "memory", "xmm0", "xmm1"
);
(It actually doesn't have to be volatile
anymore, now that the memory outputs are explicit operands.)
But a better yet solution would be to use intrinsics to implement this asm block. Not only this would make the asm block safer, it would make it more efficient, as it would enable additional compiler optimizations. Of course, this is only possible if your compiler supports intrinsics.