6

I am asking this question regarding Haswell Microarchitetcure(Intel Xeon E5-2640-v3 CPU). From the specifications of the CPU and other resources I found out that there are 10 LFBs and Size of the super queue is 16. I have two questions related to LFBs and SuperQueues:

1) What will be the maximum degree of memory level parallelism the system can provide, 10 or 16(LFBs or SQ)?

2) According to some sources every L1D miss is recorded in SQ and then SQ assigns the Line fill buffer and at some other sources they have written that SQ and LFBs can work independently. Could you please explain the working of SQ in brief?

Here is the example figure(Not for Haswell) for SQ and LFB. enter image description here References: https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

http://www.realworldtech.com/haswell-cpu/

A-B
  • 487
  • 2
  • 23
  • (off-topic) You're looking at an old copy of Intel's optimization manual. The official version is at https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf. (Currently dated June 2016, so it's newer than the Sep 2015 version you linked). – Peter Cordes Aug 20 '17 at 17:08
  • That block diagram isn't Haswell (no port6 or port7). I think it's Nehalem, based on the 36-entry RS size (vs. [54 in Sandybridge](http://www.realworldtech.com/haswell-cpu/3/)), and that it shows write-back to an "IA register set". (Sandybridge-family uses a physical register file). The cache hierarchy in Haswell is not fundamentally different from Nehalem, though: still 10 LFBs for outstanding L1d requests. I've never read much about the interface between L2 and L3. Now that you mention it, searching for "super queue" in Intel's optimization manual does turn up some stuff. Cool. – Peter Cordes Aug 20 '17 at 17:49
  • Also, that Haswell PDF is just a copy of David Kanter's http://www.realworldtech.com/haswell-cpu/ Haswell deep dive. Why link to a pdf copy? – Peter Cordes Aug 20 '17 at 17:51
  • 1
    Anyway, Haswell's 10 LFBs limit L1d concurrency from a single core, but I'm guessing that the superqueue allows prefetch from L3 into L2 (and L2 evictions) to happen independently of the LFBs. The diagram also shows that instruction-cache misses will be serviced through the superqueue, but won't use LFBs. (Because L1I is separate from L1D, and the LFBs are for the D-cache.) – Peter Cordes Aug 20 '17 at 17:54
  • @PeterCordes Thanks for your suggestions. I have updated the links. I could not find any other diagram to show the superqueue and had to use the example diagram I could find. Prefetching adds another level of complexity to the question, In my understanding, HW prefetchers don't use LFBs and SW prefetchers do( I am not very sure about this statement. ) – A-B Aug 21 '17 at 04:36
  • Most prefetchers in Intel CPUs fetch into L2, so they don't use LFBs. Prefetch into L1D does use LFBs, I think. – Peter Cordes Aug 21 '17 at 05:42

1 Answers1

4

For (1) logically the maximum parallelism would be limited by the least-parallel part of the pipeline which is the 10 LFBs, and this is probably strictly true for demand-load parallelism when prefetching is disabled or can't help. In practice, everything is more complicated once your load is at least partly helped by prefetching, since then the wider queues between L2 and RAM can be used which could make the observed parallelism greater than 10. The most practical approach is probably direct measurement: given measured latency to RAM, and observed throughput, you can calculate an effective parallelism for any particular load.

For (2) my understanding is that it is the other way around: all demand misses in L1 first allocate into the LFB (unless of course they hit an existing LFB) and may involve the "superqueue" later (or whatever it is called these days) if they also miss higher in the cache hierarchy. The diagram you included seems to confirm that: the only path from the L1 is through the LFB queue.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Thanks for your reply. I was thinking the same about the allocation of LFB for L1D misses. I am pasting some lines copies from Intel Optimization Manual. "The L1D miss creates an entry in the 16 element superqueue and allocates a line fill buffer. If the line is found in the L2 cache, it is transferred to the L1 data cache and the data access instruction can be serviced. The load latency from the L2 CACHE is 10 cycles, resulting in a performance penalty of around 6 cycles, the difference of the effective L2 CACHE and L1D latencies" – A-B Aug 21 '17 at 04:31