Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Question

We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.

Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.

Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:

64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E

32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.

We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?

We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.

#include <memory>
#include <Windows.h>
#include <iostream>

//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
    memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}

int main()
{
    const int SIZE_OF_BLOCKS = 25000000;
    const int NUMBER_ITERATIONS = 100;
    void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
    void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
    LARGE_INTEGER Frequency;
    QueryPerformanceFrequency(&Frequency);
    while (true)
    {
        LONGLONG total = 0;
        LONGLONG max = 0;
        LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
        for (int i = 0; i < NUMBER_ITERATIONS; ++i)
        {
            QueryPerformanceCounter(&StartingTime);
            MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
            QueryPerformanceCounter(&EndingTime);
            ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
            ElapsedMicroseconds.QuadPart *= 1000000;
            ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
            total += ElapsedMicroseconds.QuadPart;
            max = max(ElapsedMicroseconds.QuadPart, max);
        }
        std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
        std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
    }
    getchar();
}

Does MSVC's memcpy library function select a strategy based on CPUID or anything? e.g. AVX loop vs. `rep movsb`? Did you make sure that both buffers are at least 64B-aligned for all tests? Did you check perf counters to see if you're getting any TLB misses, or just L3 cache misses? (Skylake can do two TLB walks in parallel). Is your Broadwell-E a multi-socket system (NUMA)? — Peter Cordes, Sep 01 '16 at 02:25
Did you check the BIOS on your Broadwell system to make sure it doesn't have prefetching disabled or anything? Were you able to compare to other Broadwell or Haswell desktop systems? (rule out something being weird on the specific Broadwell machine you're testing on). — Peter Cordes, Sep 01 '16 at 02:27
2.2ms to copy 23.8MiB is about 10.6GiB/s each of read and write, for mixed read+write. Intel says [Skylake i5-6600](http://ark.intel.com/products/88188) (and other SKL models using DDR4-2133) have a theoretical max memory bandwidth is of 34.1 GB/s (or 31.8 GiB/s). So even if every load and store misses in L3 and has to go to main memory, that's only about 2/3rds of the theoretical max. That may be normal for a single thread, though. — Peter Cordes, Sep 01 '16 at 02:33
On MSVC with intrinsic functions enabled, a call to memcpy will be inlined for buffer lengths that are compile-time constants. Otherwise, for 64-bit, it will generate a call to the library function, which itself calls the `RtlCopyMemory` API function. This is what would be happening in your case, since you've prevented the memcpy call from ever being inlined. And no, it does no fancy dispatching, just some sanity checks and `rep movs`. — Cody Gray - on strike, Sep 01 '16 at 11:52
Well, I have to modify that last comment a bit. Looking at the disassembly, it appears that the 64-bit version of the function uses SSE2 instructions, except when the memory is unaligned, then it falls back to `rep movsb` for the trailing/ending unaligned bytes. Still, it is the same code running on both processors (there is no dynamic dispatching), so the implementation is not a factor in the performance difference. — Cody Gray - on strike, Sep 01 '16 at 12:07
Edited above to indicate metrics gathered compiled for 64-bit. I've actually tested about 3 Haswell/Broadwell-E and 3 Skylake machines, and every Skylake machine destroys Haswell/Broadwell-E in this metric. My Broadwell-E system is not NUMA. The CPU config in BIOS hasn't been tweaked (verified Hardware Prefetcher and Adjacent Cache Line Prefetch are both enabled). I'll take a look at the TLB/L3 cache misses on both system classes. — aggieNick02, Sep 01 '16 at 14:32
Thanks @Cody. /facepalm at rep movsb for less-than-15B of unaligned data. An unaligned vector load/store that overlapped some of the aligned bytes would be much better. (Copying the same bytes twice is fine for memcpy (not memmove), but I can imagine a case where another thread is waiting to see something in the last aligned byte of a buffer and then atomically incrementing it, only to have that clobbered by the unaligned store... If you're being paranoid about compatibility then maybe you wouldn't do this, but ERMSB makes `rep movs` weakly-ordered internally and 32bit still uses that.) — Peter Cordes, Sep 01 '16 at 14:35
The disassembly when compiled on my machine is just a bunch of movups on xmm0/xmm1 in a loop with some preamble and postamble. I'm sure I could look into optimizations related to alignment, etc., but the fact that a straight-up memcpy is so much slower with the exact same assembly is really interesting. I also played a bit with adjusting where I prevent inlining - a guard around the body of the for loop results in the same assembly for the actual memcpy. — aggieNick02, Sep 01 '16 at 14:53
@PeterCordes - What is the right way to measure the TLB/L3 cache misses? It's something I've not done before. Do I need to instrument my code to do it, i.e., https://msdn.microsoft.com/en-us/library/windows/desktop/aa371903(v=vs.85).aspx . Or use Intel's PCM at https://software.intel.com/en-us/articles/intel-performance-counter-monitor ? Or something else? — aggieNick02, Sep 01 '16 at 15:57
Updated above with further interesting numbers. First, on Skylake, the 32-bit and 64-bit versions of the same code perform similarly - approximately 2.2 ms. On Broadwell-E, the 32-bit code is substantially faster, 3.5ms vs 4.5ms for 64-bit (updated my Broadwell-E number for 64-bit, I had been too kind/conservative in remembering it). The 32-bit code is basically a rep movs. — aggieNick02, Sep 01 '16 at 16:11
The only sane way is with the CPU's performance counters. IDK what the easiest way is on Windows; but I think Intel's VTune is available for free. — Peter Cordes, Sep 01 '16 at 20:18
What actual Broadwell CPU model did you test? How many cores, and what frequency? (Curious if *many* cores on the ring bus is slower than only a few cores, or if most of the effect is just Xeon vs. client chips at all with different clock domains.) — Peter Cordes, Apr 28 '21 at 11:36
@PeterCordes i7-6800K, which is 6 cores/12 threads, at stock 3.4 GHz — aggieNick02, Apr 28 '21 at 19:14

Peter Cordes · Accepted Answer · 2019-05-25T20:41:01.587

16

Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).

(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including @BeeOnRope's testing of multiple store streams)

Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).

SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)

(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)

A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.

For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)

Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).

edited May 25 '19 at 20:41

answered Dec 13 '17 at 06:58

Peter Cordes

328,167
45
605
847

Maybe I am misunderstanding something but your answer seems to contradict what OP is observing? – Stephan Dollberg May 25 '19 at 19:45
@inf: how so? TL:DR Higher latency => lower bandwidth (per thread or for a single thread alone) on many-core chips. – Peter Cordes May 25 '19 at 19:47
1

Yeah, but isn't OP saying that he sees higher bandwidth / lower latency on Skylake? – Stephan Dollberg May 25 '19 at 19:53
1

@inf: exactly. And they have a quad-core Skylake-client chip, but a many-core Broadwell-E. SKL still uses a simple fast ring-bus; it's only SKX that moved to a slower but more scalable mesh network. – Peter Cordes May 25 '19 at 19:54
Right ok, I assumed he was talking about a skylake server one. In that case, the skylake-server paragraph should be irrelvant to the behaviour observed by OP and it's more likely to be caused by higher ringbus hopcount on the Broadwell + more LFBs? – Stephan Dollberg May 25 '19 at 20:34
@inf: yes, exactly. I didn't limit this answer to *only* the exact hardware the question was asking about, because it's a more general effect. (And BTW, we can tell the SKL is SKL-client because it has dual-channel memory). With HW prefetching, the number of LFBs might not be relevant, the L2 streamer is more likely limited by superqueue entries. IDK if SKL has more than BDW there. – Peter Cordes May 25 '19 at 20:40
1

@inf: anyway, thanks for the feedback, I hadn't realized the possible confusion. Edited to clarify. – Peter Cordes May 25 '19 at 23:17
So his Broadwell chip has the slower mesh design? – user997112 Feb 01 '20 at 22:16
@user997112: no, the mesh was new with Skylake-server, like I said in my answer. SnB-family before that uses a ring bus. – Peter Cordes Feb 01 '20 at 22:36
@PeterCordes I totally misunderstood. So whilst his Broadwell-E uses the same ring connect as the Skylake (Desktop) the fact it has many many cores illustrates the performance degradation with the ring connect design. So, to what extent do you think the performance degradation was caused by the ring design, compared with the fewer (10) LFBs? Just trying to understand which was the bigger issue. – user997112 Feb 02 '20 at 06:12
1

@user997112: Xeon CPUs separate the core vs. uncore frequencies so they need async buffering, adding even more latency beyond just extra ring hops. Quad core "client" chips have all cores (and the uncore) locked to the same frequency; they can't scale independently. This is (I think) part of what keeps uncore latency significantly lower. SKX's mesh has even more latency apparently, or for some reason even worse single-core bandwidth, but any Xeon even if its ring bus isn't huge is a different beast from client chips. (Except the quad-core workstation Xeons based on client silicon.) – Peter Cordes Feb 02 '20 at 07:21
@PeterCordes Thanks! Is this all from the developer manuals? I really ought to read them front-to-back! If not, could you recommend more resources? – user997112 Feb 03 '20 at 01:41
@user997112: I'm not sure how much of that is explicitly there in the optimization manual. Putting the pieces together to guess / explain it that way (lower latency because of not crossing frequency domains) requires *understanding* CPU architecture in general, and knowing a bunch of random facts about Intel CPUs. See various SO answers and comment-discussion from @ BeeOnRope, @ HadiBrais, myself, and occasionally other people. Some of this is linked from https://stackoverflow.com/tags/x86/info – Peter Cordes Feb 03 '20 at 02:22
1

I thought client SKL chips have a separate uncore clock. More ring stops seems not enough to explain the disparities in memory latency, unless may you need to incur the full trip several times? IIRC it's one uncore cycle per stop. Server chips (even before SKX) also have *much* NT store so I guess there is some significant design difference. Perhaps the prefetchers are all tuned differently. – BeeOnRope Jun 21 '20 at 00:41
@BeeOnRope: Good point. Maybe there is some difference in crossing clock domains from cores to uncore? Like some kind of shortcut that client chips can take even if cores and uncore are in separate clock domains. Do you know if these benchmark numbers apply to single-socket many-core Xeons? If not, it could all be due to snooping the other socket(s). – Peter Cordes Jun 21 '20 at 00:54
1

@Peter, yeah snooping other sockets definitely slows things down if there isn't a memory directory (per John McCalpin). I am not totally sure but I thought the slowdown applied also to single-socket configurations. My prior message was missing some words. It should say "much lower NT store throughput". In SKX the mystery would seem less mysterious since there are many more differences in the L3 and uncore design, but the difference existed even for older chips. – BeeOnRope Jun 21 '20 at 14:11

score 3 · Answer 2 · answered Sep 15 '16 at 18:20

3

I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.

It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.

I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.

I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.

answered Sep 15 '16 at 18:20

aggieNick02

2,557
2
23
36

Even between chips of the same uarch, like Haswell vs Haswell, the disparity in latency between client and server chips has always existed. Skylake didn't make a big jump down in memeory lantency, either: no such magic in the IMC. – BeeOnRope Jun 21 '20 at 00:43
2

@BeeOnRope The differences in throughput between "client" and "server" can be *partially* explained by the lower (pointer-chasing) memory load latency of the client systems. For the memcpy operation, "large" copies should be using streaming stores. In most generations of Intel server processors, the *occupancy* for streaming stores is higher than on the equivalent client processor. This is similar to load latency, but is more strongly bound by coherence. In SKX, for example, memory directories reduce load latency, but not streaming store occupancy. – John D McCalpin Jun 22 '20 at 15:10
Thanks Dr @McCalpin. One thing I'm not clear about: is the load-load latency and/or NT store occupancy so much worse on server chips on _single socket systems_? If yes (and my recollection is yes), why? – BeeOnRope Jun 22 '20 at 15:19

Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

2 Answers2

Linked