8

Today I have found sample code which slowed down by 50%, after adding some unrelated code. After debugging I have figured out the problem was in the loop alignment. Depending of the loop code placement there is different execution time e.g.:

Address Time[us]
00007FF780A01270 980us
00007FF7750B1280 1500us
00007FF7750B1290 986us
00007FF7750B12A0 1500us

I didn't expect previously that code alignment may have such a big impact. And I thought my compiler is smart enough to align the code correctly.

What exactly cause such a big difference in execution time ? (I suppose some processor architecture details).

The test program I have compiled in Release mode with Visual Studio 2019 and run it on Windows 10. I have checked the program on 2 processors: i7-8700k (the results above), and on intel i5-3570k but the problem does not exist there and the execution time is always about 1250us. I have also tried to compile the program with clang, but with clang the result is always ~1500us (on i7-8700k).

My test program:

#include <chrono>
#include <iostream>
#include <intrin.h>
using namespace std;

template<int N>
__forceinline void noops()
{
    __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop(); __nop();
    noops<N - 1>();
}
template<>
__forceinline void noops<0>(){}

template<int OFFSET>
__declspec(noinline) void SumHorizontalLine(const unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
    unsigned short sum = 0;
    const unsigned char* srcP1 = src - a - 1;
    const unsigned char* srcP2 = src + a;

    //some dummy loop,just a few iterations
    for (int i = 0; i < a; ++i)
        dst[i] = src[i] / (double)dst[i];

    noops<OFFSET>();
    //the important loop
    for (int x = a + 1; x < width - a; x++)
    {
        unsigned char v1 = srcP1[x];
        unsigned char v2 = srcP2[x];
        sum -= v1;
        sum += v2;
        dst[x] = sum;
    }

}

template<int OFFSET>
void RunTest(unsigned char* __restrict src, int width, int a, unsigned short* __restrict dst)
{
    double minTime = 99999999;
    for(int i = 0; i < 20; ++i)
    {
        auto start = chrono::steady_clock::now();

        for (int i = 0; i < 1024; ++i)
        {
            SumHorizontalLine<OFFSET>(src, width, a, dst);
        }

        auto end = chrono::steady_clock::now();
        auto us = chrono::duration_cast<chrono::microseconds>(end - start).count();
        if (us < minTime)
        {
            minTime = us;
        }
    }

    cout << OFFSET << " : " << minTime << " us" << endl;
}

int main()
{
    const int width = 2048;
    const int x = 3;
    unsigned char* src = new unsigned char[width * 5];
    unsigned short* dst = new unsigned short[width];
    memset(src, 0, sizeof(unsigned char) * width);
    memset(dst, 0, sizeof(unsigned short) * width);

    while(true)
    RunTest<1>(src, width, x, dst);
}

To verify different alignment, just recompile the program and change RunTest<0> to RunTest<1> etc. Compiler always align the code to 16bytes. In my test code I just insert additional nops to move the code a bit more.

Assembly code generated for the loop with OFFSET=1 (for other offset only the amount of npads is different):

  0007c 90       npad    1
  0007d 90       npad    1
  0007e 49 83 c1 08  add     r9, 8
  00082 90       npad    1
  00083 90       npad    1
  00084 90       npad    1
  00085 90       npad    1
  00086 90       npad    1
  00087 90       npad    1
  00088 90       npad    1
  00089 90       npad    1
  0008a 90       npad    1
  0008b 90       npad    1
  0008c 90       npad    1
  0008d 90       npad    1
  0008e 90       npad    1
  0008f 90       npad    1
$LL15@SumHorizon:

; 25   : 
; 26   :    noops<OFFSET>();
; 27   : 
; 28   :    for (int x = a + 1; x < width - a; x++)
; 29   :    {
; 30   :        unsigned char v1 = srcP1[x];
; 31   :        unsigned char v2 = srcP2[x];
; 32   :        sum -= v1;

  00090 0f b6 42 f9  movzx   eax, BYTE PTR [rdx-7]
  00094 4d 8d 49 02  lea     r9, QWORD PTR [r9+2]

; 33   :        sum += v2;

  00098 0f b6 0a     movzx   ecx, BYTE PTR [rdx]
  0009b 48 8d 52 01  lea     rdx, QWORD PTR [rdx+1]
  0009f 66 2b c8     sub     cx, ax
  000a2 66 44 03 c1  add     r8w, cx

; 34   :        dst[x] = sum;

  000a6 66 45 89 41 fe   mov     WORD PTR [r9-2], r8w
  000ab 49 83 ea 01  sub     r10, 1
  000af 75 df        jne     SHORT $LL15@SumHorizon

; 35   :    }
; 36   : 
; 37   : }

  000b1 c3       ret     0
??$SumHorizontalLine@$00@@YAXPEIBEHHPEIAG@Z ENDP    ; SumHorizont
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
AdamF
  • 2,501
  • 17
  • 30
  • 1
    compiler options? optimization level? – 463035818_is_not_an_ai May 07 '21 at 22:14
  • @largest_prime_is_463035818 Default Release, x64, /O2. – AdamF May 07 '21 at 22:16
  • Resolution of the timer tick? `chrono` offers nanoseconds, but unless you have really groovy custom hardware you won't get below a microsecond. On conventional desktop hardware you might not even get reliable milliseconds. – user4581301 May 07 '21 at 22:18
  • Hardware destructive interference size is real. Use it. That said, you've used _one_ compiler to test? `g++`, `clang++` and `MSVC` usually show very different performance (even with the same target arch). – Ted Lyngmo May 07 '21 at 22:21
  • @user4581301 - just increase the width variable to get execution time in seconds - the same effect. – AdamF May 07 '21 at 22:24
  • @TedLyngmo Using different compiler generates completely different code, and I would like to understand what happens in this assembly generated code. And yes, I have checked also `clang++`, there is no such effect, but clang is slower than `MSVC` in this case. – AdamF May 07 '21 at 22:26
  • @AdamF Works for me. I just see a lot of questions with folks worrying about odd jumps in "performance" simply because they're measuring too close to the tick width – user4581301 May 07 '21 at 22:32
  • @AdamF Perhaps setting up a https://quick-bench.com/ comparison would be appropriate? It doesn't do a multitude of compilers though, but `g++` and `clang++` is there. `clang++` brutes some stuff I find complex into "no, I'm not looping - I got this", `g++` is steady all over even though `clang++` wins when it finds the possible optimizations (and it does that often). I'd love to see `MSVC` in an _ExecutionPolicy_ test against `g++` and `clang++`. In my small personal tests, MSVC's version has been very good (as in owned). – Ted Lyngmo May 07 '21 at 22:33
  • alignment can certainly affect performance this is not a surprise. there is no reason for a compiler to attempt to avoid this. by aligning the beginning of data or functions or whatever can insure it instead of avoid it. x86 has so much overhead though most of these alignment issues should be buried in the noise. so interesting that you teased one out. start using assembly for this test where you can easily control the address/offset, likewise make sure this is not a timing issue which is very common (not actually measuring what you think you are measuring) – old_timer May 08 '21 at 00:35

2 Answers2

11

In the slow cases (i.e., 00007FF7750B1280 and 00007FF7750B12A0), the jne instruction crosses a 32-byte boundary. The mitigations for the "Jump Conditional Code" (JCC) erratum (https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf) prevent such instructions from being cached in the DSB. The JCC erratum only applies to Skylake-based CPUs, which is why the effect does not occur on your i5-3570k CPU.

As Peter Cordes pointed out in a comment, recent compilers have options that try to mitigate this effect. Intel JCC Erratum - should JCC really be treated separately? mentions MSVC's /QIntel-jcc-erratum option; another related question is How can I mitigate the impact of the Intel jcc erratum on gcc?

Andreas Abel
  • 1,376
  • 1
  • 10
  • 21
  • 3
    IIRC, modern GCC/clang and/or possibly even `as` itself have options to try to mitigate this. But it's a recent effect so only the latest compiler versions know about it. Related: [Intel JCC Erratum - should JCC really be treated separately?](https://stackoverflow.com/q/62305998) mentions MSVC's `/QIntel-jcc-erratum` option. (And points out that even if the erratum only involved JCC, the mitigation definitely causes a problem for JMP/CALL/RET as well.) – Peter Cordes May 08 '21 at 09:23
  • @PeterCordes That comment seems far to important to be left as a comment. If Andreas agrees, put it in the answer? – Ted Lyngmo May 08 '21 at 22:45
  • 1
    Thanks that was exactly this problem. I have verified also `/QIntel-jcc-erratum` flag, and it fixes the issue. @Andreas Abel answer is fine for me, and reading detailed explanations from @Peter Cordes is always pleasure. – AdamF May 09 '21 at 10:56
  • The interesting thing is code generated by clang which is always slow (same as the speed of incorrectly aligned MSVC version), but there the cmp/jnz code does not cross 32-byte boundary there. So it is probably completely different case anyway ( https://godbolt.org/z/bGqde9be1 ) – AdamF May 09 '21 at 11:00
  • 1
    @AdamF: Looks like clang creates a loop-carried dep chain 3 cycles long (add/sub, and a `movzx edi,di` which is pointless: the high bytes of EDI don't matter.) i.e. clang compiles it naively, as written, instead of `sum += (v2-v1)` with the subtraction not part of the loop-carried dep chain. MSVC does do that optimization. Related: [Out-of-order execution in C#](https://stackoverflow.com/q/67321596) re: minimizing latency with associative integer math. Compilers are surprisingly bad at a non-looping function, but you'd hope clang would do better in a loop. – Peter Cordes May 09 '21 at 11:09
  • @AdamF: You could hand-hold clang by using `unsigned int sum`, so it didn't think it needed to redo zero-extension inside the loop, and could instead just store the low 16 bits of it. That's how I would have written it in the first place; don't tempt the compiler into using an inconvenient size. (Also, note that C++ operators like `+` promote narrow args to `int`, so `sum -= v1` was actually promoting both sides to int, then converting to `unsigned short`, in the abstract machine.) OTOH, using narrower types can sometimes help auto-vectorization not widen too much, but no auto-vec here :/ – Peter Cordes May 09 '21 at 11:12
  • Also related, a [Phoronix article](https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode&num=1) about the JCC erratum mitigation's performance cost, with before/after benchmarks. (I think before compilers had introduced options for workarounds.) Also that article shows the key diagram of cmp/jcc or jmp alignments that trigger the problem. – Peter Cordes Jul 18 '21 at 06:19
0

I thought my compiler is smart enough to align the code correctly.

As you said, the compiler is always aligning things to a multiple of 16 bytes. This probably does account for the direct effects of alignment. But there are limits to the "smartness" of the compiler.

Besides alignment, code placement has indirect performance effects as well, because of cache associativity. If there is too much contention for the few cache lines that can map to this address, performance will suffer. Moving to an address with less contention makes the problem go away.

The compiler may be smart enough to handle cache contention effects as well, but only IF you turn on profile-guided optimization. The interactions are far too complex to predict in a reasonable amount of work; it is much easier to watch for cache conflicts by actually running the program and that's what PGO does.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720