I've always had the same question in mind about instruction level parallelism and loops:
How does a cpu parallelize loops? Do they execute multiple successive iterations at once? Do they execute subsequent instructions independent on the loop or both?
Consider the following function for reversing an array as an example:
// pretend that uint64_t is strict-aliasing safe,
// like it was defined with GNU C __attribute__((may_alias, aligned(4)))
void reverse_SSE2(uint32_t* arr, size_t length)
{
__m128i* startPtr = (__m128i*)arr;
__m128i* endPtr = (__m128i*)(arr + (length - sizeof(__m128i) / sizeof(uint32_t)));
while (startPtr < (__m128i*)((uint32_t*)endPtr - (sizeof(__m128i) / sizeof(uint32_t) - 1)))
{
__m128i lo = _mm_loadu_si128(startPtr);
__m128i hi = _mm_loadu_si128(endPtr);
__m128i reverseLo = _mm_shuffle_epi32(lo, _MM_SHUFFLE(0, 1, 2, 3));
__m128i reverseHi = _mm_shuffle_epi32(hi, _MM_SHUFFLE(0, 1, 2, 3));
_mm_storeu_si128(startPtr++, reverseHi);
_mm_storeu_si128(endPtr--, reverseLo);
}
uint64_t* startPtr64 = (uint64_t*)startPtr;
uint64_t* endPtr64 = (uint64_t*)endPtr + 1;
if (startPtr64 < (uint64_t*)((uint32_t*)endPtr64 - (sizeof(uint64_t) / sizeof(uint32_t) - 1)))
{
uint64_t lo = *startPtr64;
uint64_t hi = *endPtr64;
lo = math.rol(lo, 32);
hi = math.ror(hi, 32);
*startPtr64++ = hi;
*endPtr64 = lo;
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)math.max((uint64_t)((uint32_t*)endPtr64 - 1), (uint64_t)startPtr32);
uint32_t lo32 = *startPtr32;
uint32_t hi32 = *endPtr32;
*startPtr32 = hi32;
*endPtr32 = lo32;
}
else
{
uint32_t* startPtr32 = (uint32_t*)startPtr64;
uint32_t* endPtr32 = (uint32_t*)endPtr64 + 1;
while (endPtr32 > startPtr32)
{
uint32_t lo = *startPtr32;
uint32_t hi = *endPtr32;
*startPtr32++ = hi;
*endPtr32-- = lo;
}
}
}
This code could be rewritten for maximum performance in a few ways, depending on the answers to my questions (and yes, in this specific example it would be irrelevant but this is just an example; we could continue with vectors that (may) do overlapping stores, as long as they load data that hasn't already been reversed).
- The while loop could be unrolled, since The throughput of the instructions used is higher than what one iteration issues. If the hardware "unrolls" the loop, that would not be necessary.
- All of the code following the while loop could be rewritten to not depend on the startPtr and endPtr values and thus the while loop, by calculating the number of remainders and pointers from it. If the CPU cannot execute other instructions while looping, that would introduce additional overhead. If it can, the function would end with the while loop simultaneously.
- If the code following the while loop does not execute in parallel to it, that code might better be moved to the top of the function, so that one loop iteration can start executing in parallel to that code, since the initial check is cheap to calculate. The potential of introducing another cache miss does not matter in this case.
As an additional question (since the code following the loop has a branch and another loop): How are flags registers handled in superscalar CPUs? Are there multiple physical ones?