I have large block of data to calculate:
static float source0[COUNT];
static float source1[COUNT];
static float result[COUNT]; /* result[i] = source0[i] * source1[i]; */
s0 = (size_t)source0;
s1 = (size_t)source1;
r = (size_t)result;
They are all 32-byte aligned.
The related SSE code:
for(i = 0; i < COUNT; i += 16)
{
__asm volatile
(
"movntdqa xmm0, [%0]\n\t"
"movntdqa xmm1, [%1]\n\t"
"mulps xmm1, xmm0\n\t"
"movntps [%2], xmm1"
: : "r"(s0 + i), "r"(s1 + i), "r"(r + i) : "xmm0", "xmm1"
);
}
The related AVX code:
for(i = 0; i < COUNT; i += 32)
{
__asm volatile
(
"vmovapd ymm0, [%0]\n\t"
"vmovapd ymm1, [%1]\n\t"
"vmulps ymm1, ymm1, ymm0\n\t"
"vmovntps [%2], ymm1"
: : "r"(s0 + i), "r"(s1 + i), "r"(r + i) : "ymm0", "ymm1"
);
}
The result is that AVX code used time is always nearly the same as SSE code. But they are much faster then normal C code. I think the major reason is that "vmodapd" does not support "NT" version, until AVX2 extension. This causes too much d-cache pollution.
Is there any better way to explore the power of AVX(not AVX2)?