In DPDK's implementation of rte_memcpy
on AVX512, I see the rte_mov512blocks
function is used when copy data to a dst
aligned to 64 bytes.
However, when look into the implementation of rte_mov512blocks
, I see
static inline void
rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
{
...
_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
dst = dst + 512;
}
}
My question is why _mm512_storeu_si512
is still used? If dst
is aligned, why not use _mm512_store_si512
?