OS portable memcpy optimized for SSE2 & SSE3

Question

If I was to write a OS portable memcpy optimized for SSE2/SSE3 how would that look like? I want to support both the GCC and ICC compilers. The reason I ask is that memcpy is written in assembler code in glibc and not optimized for SSE2/SSE3, and other generic memcpy implementations may not fully take advantage of the systems capabilities with data alignment and size etc.

Here is my current memcpy that take data alignment into consideration and is optimized for SSE2 (I think) but not for SSE3:

#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
    char *s1 = b;
    const char *s2 = a;
    for(; 0<n; --n)*s1++ = *s2++;
    return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
    // Use system memcpy()
    return memcpy(dest, source, count);
#else

    size_t blockIdx;
    size_t blocks = count >> 3;
    size_t bytesLeft = count - (blocks << 3);

    // Copy 64-bit blocks first
    _UINT64 *sourcePtr8 = (_UINT64*)source;
    _UINT64 *destPtr8 = (_UINT64*)dest;
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 2;
    bytesLeft = bytesLeft - (blocks << 2);

    // Copy 32-bit blocks
    _UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
    _UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];

    if (!bytesLeft) return dest;

    blocks = bytesLeft >> 1;
    bytesLeft = bytesLeft - (blocks << 1);

    // Copy 16-bit blocks
    _UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
    _UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
    for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];

    if (!bytesLeft) return dest;

    // Copy byte blocks
    _UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
    _UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
    for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
    return dest;
#endif
}
#endif

Not all memcpy implementations are thread-safe, which is just another reason to make our own version. All this leads me to conclude I should at least try to make a thread-safe OS portable memcpy that is optimized for SSE2/SSE3 where available.

I've also read that GCC supports aggressive unrolling with the -funroll-loops compiler option, could this improve performance with SSE2 and/or SSE3 if there are no significant cache misses?

Is there a performance gain of making different memcpy versions for 32-and 64-bit architectures?

Is there any performance gain of pre-aligning internal memory buffers before copying?

How do I use the #pragma loop to controls how loop code is to be considered by the SSE2/SSE3 auto-parallelizer? Supposedly one could use #pragma loop on contiguous data regions are moved by a for() loop.

Do I need to use the GCC compiler option -fno-builtin-memcpy even with -O3 to force the compiler from inlining the GCC memcpy when adding my own memcpy? Or perhaps just overriding memcpy in my code is enough?

Update: After some tests it seems to me that an SSE2 optimized memcpy() is not that much faster for it to be worth the effort. I've asked a question in that regard on the Intel C/C++ Compiler forums.

I'm not sure I understood - why do you think current memcpy implementations are not SSE* (or AVX*) optimized? see - http://stackoverflow.com/questions/8425022/performance-of-x86-rep-instructions-on-modern-pipelined-superscalar-processors — Leeor, Sep 28 '13 at 20:01
@Leeor Mostly because this Intel paper says so http://software.intel.com/en-us/articles/memcpy-performance , at least thats how I read it. glibc memcpy is written in asm and not optimized for SSE*, and GCC memcpy have serious perfomance issues it says. — , Sep 28 '13 at 20:02
@IngeHenriksen "glibc is written in ASM"? Where did you get that from? Or are you talking about `memcpy` only? — us2012, Sep 28 '13 at 20:05
@us2012 Yes, glibc `memcpy` is written in asm - not the entire glibc library. — , Sep 28 '13 at 20:07
@IngeHenriksen, thanks for the link, i'd like to look further into that. Keep in mind that rep-strings should be able to perform most of the above code internally using the max HW bandwidth (and including the painful alignment checks), as described in the optimization guides. — Leeor, Sep 28 '13 at 20:19

OS portable memcpy optimized for SSE2 & SSE3

0 Answers0