If I was to write a OS portable memcpy
optimized for SSE2/SSE3 how would that look like? I want to support both the GCC and ICC compilers. The reason I ask is that memcpy
is written in assembler code in glibc and not optimized for SSE2/SSE3, and other generic memcpy
implementations may not fully take advantage of the systems capabilities with data alignment and size etc.
Here is my current memcpy
that take data alignment into consideration and is optimized for SSE2 (I think) but not for SSE3:
#ifdef __SSE2__
// SSE2 optimized memcpy()
void *CMemUtils::MemCpy(void *restrict b, const void *restrict a, size_t n)
{
char *s1 = b;
const char *s2 = a;
for(; 0<n; --n)*s1++ = *s2++;
return b;
}
#else
// Generic memcpy() implementation
void *CMemUtils::MemCpy(void *dest, const void *source, size_t count) const
{
#ifdef _USE_SYSTEM_MEMCPY
// Use system memcpy()
return memcpy(dest, source, count);
#else
size_t blockIdx;
size_t blocks = count >> 3;
size_t bytesLeft = count - (blocks << 3);
// Copy 64-bit blocks first
_UINT64 *sourcePtr8 = (_UINT64*)source;
_UINT64 *destPtr8 = (_UINT64*)dest;
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr8[blockIdx] = sourcePtr8[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 2;
bytesLeft = bytesLeft - (blocks << 2);
// Copy 32-bit blocks
_UINT32 *sourcePtr4 = (_UINT32*)&sourcePtr8[blockIdx];
_UINT32 *destPtr4 = (_UINT32*)&destPtr8[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr4[blockIdx] = sourcePtr4[blockIdx];
if (!bytesLeft) return dest;
blocks = bytesLeft >> 1;
bytesLeft = bytesLeft - (blocks << 1);
// Copy 16-bit blocks
_UINT16 *sourcePtr2 = (_UINT16*)&sourcePtr4[blockIdx];
_UINT16 *destPtr2 = (_UINT16*)&destPtr4[blockIdx];
for (blockIdx = 0; blockIdx < blocks; blockIdx++) destPtr2[blockIdx] = sourcePtr2[blockIdx];
if (!bytesLeft) return dest;
// Copy byte blocks
_UINT8 *sourcePtr1 = (_UINT8*)&sourcePtr2[blockIdx];
_UINT8 *destPtr1 = (_UINT8*)&destPtr2[blockIdx];
for (blockIdx = 0; blockIdx < bytesLeft; blockIdx++) destPtr1[blockIdx] = sourcePtr1[blockIdx];
return dest;
#endif
}
#endif
Not all memcpy
implementations are thread-safe, which is just another reason to make our own version. All this leads me to conclude I should at least try to make a thread-safe OS portable memcpy
that is optimized for SSE2/SSE3 where available.
I've also read that GCC supports aggressive unrolling with the -funroll-loops
compiler option, could this improve performance with SSE2 and/or SSE3 if there are no significant cache misses?
Is there a performance gain of making different memcpy versions for 32-and 64-bit architectures?
Is there any performance gain of pre-aligning internal memory buffers before copying?
How do I use the #pragma loop
to controls how loop code is to be considered by the SSE2/SSE3 auto-parallelizer? Supposedly one could use #pragma loop
on contiguous data regions are moved by a for() loop.
Do I need to use the GCC compiler option -fno-builtin-memcpy
even with -O3
to force the compiler from inlining the GCC memcpy
when adding my own memcpy
? Or perhaps just overriding memcpy
in my code is enough?
Update:
After some tests it seems to me that an SSE2 optimized memcpy()
is not that much faster for it to be worth the effort. I've asked a question in that regard on the Intel C/C++ Compiler forums.