16

Please do not say this is premature microoptimization. I want to understand, as much as it is possible given my limited knowledge, how the described SB feature and assembly works, and make sure that my code makes use of this architectural feature. Thank you for understanding.

I've started to learn intrinsics a few days ago so the answer may seem obvious to some, but I don't have a reliable source of information to figure this out.

I need to optimize some code for a Sandy Bridge CPU (this is a requirement). Now I know that it can do one AVX multiply and one AVX add per cycle, and read this paper:

http://research.colfaxinternational.com/file.axd?file=2012%2F7%2FColfax_CPI.pdf

which shows how it can be done in C++. So, the problem is that my code won't get auto-vectorized using Intel's compiler (which is another requirement for the task), so I decided to implement it manually using intrinsics like this:

__sum1 = _mm256_setzero_pd();
__sum2 = _mm256_setzero_pd();
__sum3 = _mm256_setzero_pd();
sum = 0;
for(kk = k; kk < k + BS && kk < aW; kk+=12)
{
    const double *a_addr = &A[i * aW + kk];
    const double *b_addr = &newB[jj * aW + kk];
    __aa1 = _mm256_load_pd((a_addr));
    __bb1 = _mm256_load_pd((b_addr));
    __sum1 = _mm256_add_pd(__sum1, _mm256_mul_pd(__aa1, __bb1));

    __aa2 = _mm256_load_pd((a_addr + 4));
    __bb2 = _mm256_load_pd((b_addr + 4));
    __sum2 = _mm256_add_pd(__sum2, _mm256_mul_pd(__aa2, __bb2));

    __aa3 = _mm256_load_pd((a_addr + 8));
    __bb3 = _mm256_load_pd((b_addr + 8));
    __sum3 = _mm256_add_pd(__sum3, _mm256_mul_pd(__aa3, __bb3));
}
__sum1 = _mm256_add_pd(__sum1, _mm256_add_pd(__sum2, __sum3));
_mm256_store_pd(&vsum[0], __sum1);

The reason I manually unroll the loop like this is explained here:

Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell

They say you need to unroll by a factor of 3 to achieve the best performance on Sandy. My naive testing confirms that this indeed runs better than without unrolling or 4-fold unrolling.

OK, so here is the problem. The icl compiler from Intel Parallel Studio 15 generates this:

    $LN149:
            movsxd    r14, r14d                                     ;78.49
    $LN150:
            vmovupd   ymm3, YMMWORD PTR [r11+r14*8]                 ;80.48
    $LN151:
            vmovupd   ymm5, YMMWORD PTR [32+r11+r14*8]              ;84.49
    $LN152:
            vmulpd    ymm4, ymm3, YMMWORD PTR [r8+r14*8]            ;82.56
    $LN153:
            vmovupd   ymm3, YMMWORD PTR [64+r11+r14*8]              ;88.49
    $LN154:
            vmulpd    ymm15, ymm5, YMMWORD PTR [32+r8+r14*8]        ;86.56
    $LN155:
            vaddpd    ymm2, ymm2, ymm4                              ;82.34
    $LN156:
            vmulpd    ymm4, ymm3, YMMWORD PTR [64+r8+r14*8]         ;90.56
    $LN157:
            vaddpd    ymm0, ymm0, ymm15                             ;86.34
    $LN158:
            vaddpd    ymm1, ymm1, ymm4                              ;90.34
    $LN159:
            add       r14d, 12                                      ;76.57
    $LN160:
            cmp       r14d, ebx                                     ;76.42
    $LN161:
            jb        .B1.19        ; Prob 82%                      ;76.42

To me, this looks like a mess, where the correct order (add next to multiply required to use the handy SB feature) is broken.

Question:

  • Will this assembly code leverage the Sandy Bridge feature I am referring to?

  • If not, what do I need to do in order to utilize the feature and prevent the code from becoming "tangled" like this?

Also, when there is only one loop iteration, the order is nice and clean, i.e. load, multiply, add, as it should be.

Community
  • 1
  • 1
iksemyonov
  • 4,106
  • 1
  • 22
  • 42
  • 2
    I can't tell from your question whether you are aware that the processor itself is capable to reordering instructions. So the adds don't *need* to be next to the multiplies. Furthermore, the bottleneck in your code will be the loads. So you won't get much from overlapping adds and multiplies anyway. – Mysticial Jan 04 '15 at 20:18
  • Yes, I am aware that a CPU can reorder instructions, but not when and how exactly it will do so. I know that memory is the most important part of the algorithm, sure, but when memory is more or less fine, I'd like to be sure that the FPU is working full-steam, correct? – iksemyonov Jan 04 '15 at 20:25
  • 2
    The FPU *can't* operate at full capacity in your example. Sandy Bridge can only sustain one AVX load each cycle. So the loop takes a minimum 6 cycles. To saturate the FPUs, you need 6 adds *and* 6 multiplies. But you only have 3 of each - so you will never get more than 50% FPU throughput. – Mysticial Jan 04 '15 at 20:26
  • Hmmm. Now I'm a bit lost. Can you please check out the SO question link I posted? They say 3 times gives you the best performance. Also, the colfax article claims you can do one add and one multiply simultaneously on SB. Add: Wait, gotcha, it's about loads vs processing the data. Thank you for the advice! – iksemyonov Jan 04 '15 at 20:28
  • 5
    This has nothing to do with the unroll factor. You simply have too many loads. Sandy bridge, can sustain 1 load, 1 add, and 1 multiply each cycle. But you need 2 loads, 1 add, and 1 multiply. So your bottleneck is the loads. – Mysticial Jan 04 '15 at 20:30
  • See above - I misunderstood your comment. Also, why does the Intel SB diagram (can't refer to it right now - it was in a keynote university) show 2 ports labeled "load"? Do I misunderstand that as well? – iksemyonov Jan 04 '15 at 20:31
  • Erm, must be a mistake. Automatics :) – iksemyonov Jan 04 '15 at 20:34
  • Ports 2 and 3 on SB each do 128-bit loads. You have to use both ports to get one 256-bit load. Haswell can do two 256-bit loads per clock cycle. – Z boson Jan 05 '15 at 08:34
  • 1
    If you look at the code in my link you referenced you will see that one of the factors is constant in the loop (`__m256 a8 = _mm256_set1_ps(1.0f);`). If you define `__aa1 = _mm256_load_pd((a_addr));` outside of your loop (or broadcast a value which is probably what you really want to do) then you will only have one 256-bit load per mult-add instead of two. Of course this will change what your doing so you need to think about what you want to do and see if this is possible. – Z boson Jan 05 '15 at 08:39
  • It's strange that ICC does not use the aligned loads which you used with the intrinsics. This prevents fused load mult instructions. I would try this on GCC and compare the assembly and the performance. – Z boson Jan 05 '15 at 08:46
  • Z boson, could you outline your ideas in a reply (opp. to an inline comment) so they wouldn't get lost? Where can I find a reference to the port size? All I have is an Intel keynote without the sizes specified, and haven't googled it up so far. I'd also be grateful if you elaborated on fused load-multiply: what is actually fused together, loads + mults, or adds + mults? – iksemyonov Jan 05 '15 at 11:10
  • Yes, I realized later that one of the variables was only loaded once, outside of the loop. Maybe the alignment has to do with the array alignment? Currently the loads are made straight from a function argument - what if I "repack" the array into a smaller one, but explicitly aligned. – iksemyonov Jan 05 '15 at 11:13
  • I'll try and give an answer tomorrow. I'm swamped at work lately :-( – Z boson Jan 05 '15 at 17:54
  • Sure, same here, non-stop studies – iksemyonov Jan 05 '15 at 18:00
  • @Zboson On Sandy Bridge, an unaligned load/store is no slower than an aligned load/store when the address is aligned. So there's no reason to use aligned load/stores anymore. Furthermore, memory operands no longer need to be aligned. Personally, I'd prefer that it use aligned load/stores since I'll know right away if my data is misaligned. Compiler's today don't fuse multiply-adds if you're using intrinsics since it breaks IEEE and it assumes you already know what you're doing. – Mysticial Jan 06 '15 at 02:24
  • @Mysticial, I was not referring to fusing the multiply-add, I was referring to fusing the multiplication and load (micro-op fusion). I have seen different results with and without it https://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp. As far as I can tell the fusion does not happen with unaligned loads. So there may not be a penalty to use an unaligned store anymore in terms of that instruction but that does not mean it does not have other effects. – Z boson Jan 06 '15 at 08:34
  • @Mysticial, I looked at the OPs assembly more carefully. It turns out that it is using the fused mult-loads (`vmulpd ymm4, ymm3, YMMWORD PTR [r8+r14*8]`). So it uses aligned loads for the fused mult-loads and unaligned loads when it's only a load. – Z boson Jan 06 '15 at 10:11
  • @Zboson Wait, I don't get it. There's no such thing as an "aligned" multiply with memory operand. They're all "unaligned". (for AVX) – Mysticial Jan 06 '15 at 10:22
  • @Mysticial, do you mean that in (`vmulpd ymm4, ymm3, YMMWORD PTR [r8+r14*8]`) that `YMMWORD PTR [r8+r14*8]` does not need to be 32-byte aligned? – Z boson Jan 06 '15 at 10:29
  • @Mysticial from my link here https://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp it's clear MSVC fused the load and mult even though I did not use a aligned load so I guess that says that it goes not need to be aligned as you say. GCC, however, never does a fused load mult if you use an unaligned load intrinsics. I don't know why that is. – Z boson Jan 06 '15 at 10:50

1 Answers1

6

With x86 CPUs many people expect to get the maximum FLOPS from the dot product

for(int i=0; i<n; i++) sum += a[i]*b[i];

but this turns out not to be the case.

What can give the maximum FLOPS is this

for(int i=0; i<n; i++) sum += k*a[i];

where k is a constant. Why is the CPU not optimized for the dot product? I can speculate. One of the things CPUs are optimized for is BLAS. BLAS is considering a building block of many other routines.

The Level-1 and Level-2 BLAS routines become memory bandwidth bound as n increases. It's only the Level-3 routines (e.g. Matrix Multiplication) which are capable of being compute bound. This is because the Level-3 computations go as n^3 and the reads as n^2. So the CPU is optimized for the Level-3 routines. The Level-3 routines don't need to optimize for a single dot product. They only need to read from one matrix per iteration (sum += k*a[i]).

From this we can conclude that the number of bits needed to be read each cycle to get the maximum FLOPS for the Level-3 routines is

read_size = SIMD_WIDTH * num_MAC

where num_MAC is the number of multiply–accumulate operations that can be done each cycle.

                   SIMD_WIDTH (bits)   num_MAC  read_size (bits)  ports used
Nehalem            128                 1         128              128-bits on port 2
Sandy Bridge       256                 1         256              128-bits port 2 and 3
Haswell            256                 2         512              256-bits port 2 and 3
Skylake            512                 2        1024              ?

For Nehalem-Haswell this agrees with what the hardware is capable of. I don't actually know that Skylake will be able to read 1024-bits per clock cycle but if it can't AVX512 won't be very interesting so I'm confident in my guess. A nice plot for Nahalem, Sandy Bridge, and Haswell for each port can be found at http://www.anandtech.com/show/6355/intels-haswell-architecture/8

So far I have ignored latency and dependency chains. To really get the maximum FLOPS you need to unroll the loop at least three times on Sandy Bridge (I use four because I find it inconvenient to work with multiples of three)

The best way to answer your question about performance is to find the theoretic best performance you expect for your operation and then compare how close your code get to this. I call this the efficiency. Doing this you will find that despite the reordering of the instructions you see in the assembly the performance is still good. But there are many other subtle issues you may need to consider. Here are three issues I encountered:

l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096.

obtaining-peak-bandwidth-on-haswell-in-the-l1-cache-only-getting-62%

difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp.

I also suggest you consider using IACA to study the performance.

Z boson
  • 32,619
  • 11
  • 123
  • 226
  • I wouldn't go so far to say that "AVX512 won't be interesting if can't load 1024 bits per cycle". Matrix multiply isn't the only application out there. The stuff that I deal with has much higher compute/load ratios. But given that Intel does seem to be optimizing the processor for linear algebra, it would be pretty hard not to get dual-issue 512-bit loads. – Mysticial Jan 06 '15 at 18:03
  • @Mysticial, you're right. I should have said it won't be interesting for BLAS. I think DGEMM is a benchmark many people, especially in HPC (Top500), expect. So for bragging rights Intel wants dual-issue 512-bit loads. I don't know if emphasizing BLAS is a good thing to optimize for in general. – Z boson Jan 07 '15 at 07:57
  • Thank you for the elaborate answer, I don't yet have time to study all the links, but will do so soon! – iksemyonov Jan 10 '15 at 14:05