Out of Order execution and loops

Question

I've always had the same question in mind about instruction level parallelism and loops:
How does a cpu parallelize loops? Do they execute multiple successive iterations at once? Do they execute subsequent instructions independent on the loop or both?

Consider the following function for reversing an array as an example:

// pretend that uint64_t is strict-aliasing safe, 
// like it was defined with GNU C __attribute__((may_alias, aligned(4)))

void reverse_SSE2(uint32_t* arr, size_t length)
{
    __m128i* startPtr = (__m128i*)arr;
    __m128i* endPtr = (__m128i*)(arr + (length - sizeof(__m128i) / sizeof(uint32_t)));
    while (startPtr < (__m128i*)((uint32_t*)endPtr - (sizeof(__m128i) / sizeof(uint32_t) - 1)))
    {
        __m128i lo = _mm_loadu_si128(startPtr);
        __m128i hi = _mm_loadu_si128(endPtr);
        __m128i reverseLo = _mm_shuffle_epi32(lo, _MM_SHUFFLE(0, 1, 2, 3));
        __m128i reverseHi = _mm_shuffle_epi32(hi, _MM_SHUFFLE(0, 1, 2, 3));
        _mm_storeu_si128(startPtr++, reverseHi);
        _mm_storeu_si128(endPtr--, reverseLo);
    }

    uint64_t* startPtr64 = (uint64_t*)startPtr;
    uint64_t* endPtr64 = (uint64_t*)endPtr + 1;
    if (startPtr64 < (uint64_t*)((uint32_t*)endPtr64 - (sizeof(uint64_t) / sizeof(uint32_t) - 1)))
    {
        uint64_t lo = *startPtr64;
        uint64_t hi = *endPtr64;
        lo = math.rol(lo, 32);
        hi = math.ror(hi, 32);
        *startPtr64++ = hi;
        *endPtr64 = lo;

        uint32_t* startPtr32 = (uint32_t*)startPtr64;
        uint32_t* endPtr32 = (uint32_t*)math.max((uint64_t)((uint32_t*)endPtr64 - 1), (uint64_t)startPtr32);
        uint32_t lo32 = *startPtr32;
        uint32_t hi32 = *endPtr32;
        *startPtr32 = hi32;
        *endPtr32 = lo32;
    }
    else
    {
        uint32_t* startPtr32 = (uint32_t*)startPtr64;
        uint32_t* endPtr32 = (uint32_t*)endPtr64 + 1;
        while (endPtr32 > startPtr32)
        {
            uint32_t lo = *startPtr32;
            uint32_t hi = *endPtr32;
            *startPtr32++ = hi;
            *endPtr32-- = lo;
        }
    }
}

This code could be rewritten for maximum performance in a few ways, depending on the answers to my questions (and yes, in this specific example it would be irrelevant but this is just an example; we could continue with vectors that (may) do overlapping stores, as long as they load data that hasn't already been reversed).

The while loop could be unrolled, since The throughput of the instructions used is higher than what one iteration issues. If the hardware "unrolls" the loop, that would not be necessary.
All of the code following the while loop could be rewritten to not depend on the startPtr and endPtr values and thus the while loop, by calculating the number of remainders and pointers from it. If the CPU cannot execute other instructions while looping, that would introduce additional overhead. If it can, the function would end with the while loop simultaneously.
If the code following the while loop does not execute in parallel to it, that code might better be moved to the top of the function, so that one loop iteration can start executing in parallel to that code, since the initial check is cheap to calculate. The potential of introducing another cache miss does not matter in this case.

As an additional question (since the code following the loop has a branch and another loop): How are flags registers handled in superscalar CPUs? Are there multiple physical ones?

Loops aren't special, they're just branches. Trace the dependency chains across iterations. See [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331) / [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) / [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) (and other links in https://stackoverflow.com/tags/x86/info, especially https://agner.org/optimize/) — Peter Cordes, Jan 18 '22 at 05:30
Fun fact: there is actually a kind of "unrolling" that happens in the loop buffer (LSD) in Sandybridge-family since Haswell: [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872). But no, they don't create any of the benefit that compile-time unrolling would, like using pointer offsets in addressing modes for a few loads/stores, only incrementing pointers once per iteration. — Peter Cordes, Jan 18 '22 at 05:32
Flag registers are renamed just as architectural registers are. — Olsonist, Jan 18 '22 at 05:51
@PeterCordes lot's to get into so far - thank you! Looking for a very narrow set of specific answers, though. I understand that the CPU predicts to go through another interation of the loop at some point and will then basically fill its pipeline with as many iterations as possible, but does it also execute the not (YET! - it will, unconditionally) taken instruction stream after the loop? — MrUnbelievable92, Jan 18 '22 at 05:51
If the loop-exit is correctly predicted for the last iteration (when it falls through), yes, the window of instructions it can see for OoO exec will include code after the loop. But it's quite common for it to mispredict the final iteration because it was taken N times during the loop. Of course, fast-recovery allows that mispredict to be handled while some loop iterations are still in flight, especially if the loop was unrolled so the control dep-chain is much shorter than the loop body. (And for some reason it can run ahead, e.g. loop body bottlenecks on load/store port throughput.) — Peter Cordes, Jan 18 '22 at 06:00
Related: [What exactly happens when a skylake CPU mispredicts a branch?](https://stackoverflow.com/q/50984007) re: fast recovery to allow detecting a mispredict and starting to execute the correct path without draining the ROB first. — Peter Cordes, Jan 18 '22 at 06:01
I looked up an email from an Intel compiler engineer. Flags are not just renamed. Individual bits are renamed. — Olsonist, Jan 18 '22 at 06:19
@Olsonist: not quite: it's split into two chunks in current Intel CPUs: the the carry flag (CF) and everything else (SPAZO). This allows register renaming without false dependencies for `inc`/`dec` which leave CF unmodified, and for `bt` and a few others which write *only* CF. See [What is a Partial Flag Stall?](https://stackoverflow.com/q/49867597) - and note that Intel since Broadwell or Skylake doesn't do FLAGS merging uops anymore: instead, uops like `ja` simply read both CF and SPAZO as two separate inputs. (That's why most `cmov` instructions are 1 uop, but `cmova` / `be` are 2 uops) — Peter Cordes, Jan 18 '22 at 06:35

Out of Order execution and loops

0 Answers0