0

In DPDK's implementation of rte_memcpy on AVX512, I see the rte_mov512blocks function is used when copy data to a dst aligned to 64 bytes. However, when look into the implementation of rte_mov512blocks, I see

static inline void
rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
{
...
        _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
        _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
        _mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
        _mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
        _mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
        _mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
        _mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
        _mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
        dst = dst + 512;
    }
}

My question is why _mm512_storeu_si512 is still used? If dst is aligned, why not use _mm512_store_si512?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
calvin
  • 2,125
  • 2
  • 21
  • 38
  • 2
    I believe that on many (most) machines, the aligned load/store instructions are not actually any faster than unaligned, when the address actually is aligned. There's a brief mention at the end of [this answer](https://stackoverflow.com/a/54049733/634919), maybe more elsewhere that I didn't find. So the only reason to use the aligned instructions would be if you specifically want to fault on unaligned addresses. – Nate Eldredge Apr 02 '23 at 15:56

0 Answers0