2

As known all levels of cache L1/L2/L3 on modern x86_64 are virtually indexed, physically tagged. And all cores communicate via Last Level Cache - cache-L3 by using cache coherent protocol MOESI/MESIF over QPI/HyperTransport.

For example, Sandybridge family CPU has 4 - 16 way cache L3 and page_size 4KB, then this allows to exchange the data between concurrent processes which are executed on different cores via a shared memory. This is possible because cache L3 can't contain the same physical memory area as a page of process 1 and as a page of process 2 at the same time.

Does this mean that every time when the process-1 requests the same shared memory region, then the process-2 flushes its cache-lines of page into the RAM, and then process-1 loaded the same memory region as cache-lines of page in virtual space of process-1? It's really slow or processor uses some optimizations?

Does modern x86_64 CPU use the same cache lines, without any flushes, to communicate between 2 processes with different virtual spaces via a shared memory?

Sandy Bridge Intel CPU - cache L3:

  • 8 MB - cache size
  • 64 B - cache line size
  • 128 K - lines (128 K = 8 MB / 64 B)
  • 16-way
  • 8 K - number sets of ways (8 K = 128 K lines / 16-way)
  • 13 bits [18:6] - of virtual address (index) defines current set number (this is tag)
  • 512 K - each the same (virtual address / 512 K) compete for the same set (8 MB / 16-way)
  • low 19 bits - significant for determining the current set number

  • 4 KB - standard page size

  • only low 12 bits - the same in virtual and physical addresses for each address

We have 7 missing bits [18:12] - i.e. we need to check (7^2 * 16-way) = 1024 cache lines. This is the same as 1024-way cache - so this is very slow. Does this mean, that cache L3 is (physically indexed, physically tagged)?

Summary of missing bits in virtual address for tag (page size 8 KB - 12 bits):

  • L3 (8 MB = 64 B x 128 K lines), 16-way, 8 K sets, 13 bits tag [18:6] - missing 7 bits
  • L2 (256 KB = 64 B x 4 K lines), 8-way, 512 sets, 9 bits tag [14:6] - missing 3 bits
  • L1 (32 KB = 64 B x 512 lines), 8-way, 64 sets, 6 bits tag [11:6] - no missing bits

It should be:

  • L3 / L2 (physically indexed, physically tagged) used after TLB lookup
  • L1 (virtually indexed, physically tagged)

enter image description here

Cœur
  • 37,241
  • 25
  • 195
  • 267
Alex
  • 12,578
  • 15
  • 99
  • 195
  • 1
    re: your edit. No, L3 is absolutely not virtually tagged. It doesn't get flushed on transitions, and it **does** backstop coherency traffic. The only question is exactly *how* it does that. Like I said in my answer, my best guess is that it's physically indexed as well as physically tagged. That would make sense for multiple reasons, including this: only the L1 cache needs to care about both virtual and physical addresses. When querying higher-level caches, only the physical address needs to be sent to them at all. – Peter Cordes Nov 29 '15 at 14:19
  • David Kanter's writeup doesn't say anything about L2/L3 being virtually indexed, either. You should edit that mis-statement out of your question (and the one I pointed out at the start of my answer). There's no advantage to virtual indexing if you have the physical address available, so phys/phys makes a ton of sense. – Peter Cordes Nov 29 '15 at 14:23
  • @Peter Cordes Ok, I fixed it. – Alex Nov 29 '15 at 14:29
  • @Peter Cordes I.e. we have 3 steps: **1.** Core sends query to L1 and TLB-L1 simultaneously, and receive answer at the same time from both. **2.** From L1 we receive data of required cache-line if L1 contains it, and from TLB-L1 (100 Entry * 4 KB page size = 400 KB) we receive physical address of this cache-line if it is in L2/L3. **3.** If L2 or L3 doesn't contain it, then we send query to TLB-L2, isn't it? – Alex Nov 29 '15 at 15:02
  • No, 2nd-level TLB isn't the TLB for the L2 cache. It's a 2nd-level for the TLB. If L1TLB misses, L1D$ can't even check its tags until either L2TLB hits, or a full TLB miss happens and the CPU walks the page table. (And evicts an old TLB entry, replacing it with the newly-found one.) – Peter Cordes Nov 29 '15 at 15:12
  • @Peter Cordes I don't think that TLB-L2 is for L2 cache :) But I was wrong when I thought that TLB-L1 (100 entry for 4 KB page size), completely covers L2 cache (256 KB), but L2 can contain 4096 cache lines from different 4096 pages, that require 4096 entries, what can't provide even TLB-L2 (512 entry). TLB-L1 can completely cover L2 cache only if cache lines loaded to cache as continuous sequence, so require only 64 entries. – Alex Nov 29 '15 at 15:47
  • There's no direct interaction between which lines are hot in L2 and which translations are hot in the TLB. You can do a TLB flush without flushing L2, e.g. after a task switch or an mmap/munmap. And as you say, when caching only a few lines per page, there are nowhere near enough TLB entries. I hadn't ever thought about a connection between number of TLB entries and amount of contiguous memory that can be cached. I mean, if you have a lot of contig memory, you can always just use a hugepage. – Peter Cordes Nov 29 '15 at 16:37

1 Answers1

4

This is possible because cache L3 can't contain the same physical memory area as page of process 1 and as page of process 2 at the same time.

Huh what? If both processes have a page mapped, they can both hit in the cache for the same line of physical memory.

That's part of the benefit of Intel's multicore designs using large inclusive L3 caches. Coherency only requires checking L3 tags to find cache lines in E or M state in another core's L2 or L1 cache.

Getting data between two cores only requires writeback to L3. I forget where this is documented. Maybe http://agner.org/optimize/ or What Every Programmer Should Know About Memory?. Or for cores that don't share any level of cache, you need a transfer between different caches at the same level of the cache hierarchy, as part of the coherency protocol. This is possible even if the line is "dirty", with the new owner assuming responsibility for eventually writing-back the contents that don't match DRAM.


The same cache line mapped to different virtual addresses will always go in the same set of the L1 cache. See discussion in comments: L2 / L3 caches are physically-index as well as physically tagged, so aliasing is never a problem. (Only L1 could get a speed benefit from virtual indexing. L1 cache misses aren't detected until after address translation is finished, so the physical address is ready in time to probe higher level caches.)

Also note that the discussion in comments incorrectly mentions Skylake lowering the associativity of L1 cache. In fact, it's the Skylake L2 cache that's less associative than before (4-way, down from 8-way in SnB/Haswell/Broadwell). L1 is still 32kiB 8-way as always: the maximum size for that associativity that keeps the page-selection address bits out of the index. So there's no mystery after all.

Also see another answer to this question about HT threads on the same core communicating through L1. I said more about cache ways and sets there.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Could you expand the math for your argument? How many bits you need for the index depends on the cache size, cache line length and the associativity of the cache. So for a 8MB 16 way cache with 64 B per cacheline we should need log2(2^23/(2^4*2^6))=13 index bits, but a page only covers 12 bits. – Voo Nov 29 '15 at 10:20
  • "The number of index bits doesn't increase with cache size" - how that? For simplicity let's go with a direct-mapped cache. If we have 8 cache buckets, that means we have to index with the lower 3 bits. If we had 16, we would need the lower 4 bits and so on. Even in a set-associative cache the size of sets is fixed, which means the larger your cache size the more sets you get, which again means you have to use more bits to decide which set you want. – Voo Nov 29 '15 at 11:04
  • @Peter Cordes Thank you. But **64 bit of virtual address** (12 bit the same as physical) is enough to check that do we need to flush L3 to RAM or not, but **not enough to check that this is the same physical address without using TLB**. I.e. if we have one physical 4KB which mapped to two different virtual addresses, then to take a decision to flush/displace the cache line, or to use it repeatedly - we definitely need to use TLB when accessing to the cache. Does do this processor? – Alex Nov 29 '15 at 11:51
  • @Alex: yes, like I said in the linked answer, Intel's CPUs use physically-tagged caches. Tag checks are always done with physical addresses. If you want to think of it this way, cache-coherency happens after the TLB is done translating. Of course you need to use the TLB when accessing the cache. [These slides](http://www.dauniv.ac.in/downloads/CArch_PPTs/CompArchCh10L09CachesandVirtualMemory.pdf) do a decent job of explaining things, I think. – Peter Cordes Nov 29 '15 at 12:04
  • 1
    @Voo: Oh, I see where I went wrong. I had it backwards. The way I was calculating, the number of lines in each way was fixed, rather than the number of ways! So larger caches would be more associative. I'm still sure that CPUs solve this problem somehow (and can't have the same physical line in the cache twice, in different ways for different virtual addresses), but now I'm not sure how. Any ideas? – Peter Cordes Nov 29 '15 at 12:13
  • @Peter Well you can handle it with additional logic in the CPU - AMD did that with the Opterons iirc - but considering that I'm off by just a single bit I might just be overlooking something (don't know what though). – Voo Nov 29 '15 at 12:25
  • @Voo: Xeons can have 45MB of L3. It's not just a single bit. – Peter Cordes Nov 29 '15 at 12:37
  • @Voo Can this solves this problem? "The L3 cache for Sandy Bridge is scalable and implementation specific. The write-back L3 cache for high-end client models is 8MB and **each 2MB slice is 16-way associative.**" http://www.realworldtech.com/sandy-bridge/8/ I.e. 4 slices. Does it mean, that L3 cache for Sandy Bridge is 64-way associative (4 slices * 16-way)? But you said that AMD Opterons solves this problem by using IIRC, does it mean that it has groups of ways more than 4K (12 bit)? – Alex Nov 29 '15 at 12:41
  • @Voo and Alex: **Probably caches other than L1 use physical indexing** as well as physical tags. L1 only uses virtual indexing as a performance optimization to fetch the right set of tags in parallel with the TLB is generating the physical address. You only check L2 if you miss in L1, and by then you already have the physical address ready. This doesn't explain how Skylake's 4-way L1 cache works, though, or AMDs larger low-associativity L1 caches. **Maybe this is what multiple banks is about**? http://www.7-cpu.com/cpu/K8.html (64 KB. 64 B/line, 2-WAY. 8 banks.) – Peter Cordes Nov 29 '15 at 13:40
  • @Alex: see my prev comment. I should pull out my copy of Patterson and Hennessy and see what they have to say. (Always nice to find out my undergrad textbooks are highly-regarded regarded references that I'm glad I kept. :) I'm not really that curious ATM, though. I know it works, because that's been well established by various discussions of performance from various sources, talking about how L3 is the backstop for coherency instead of main mem, thanks to the inclusive L3 cache. – Peter Cordes Nov 29 '15 at 13:47
  • @PaulA.Clayton: I found your http://superuser.com/questions/745008/whats-the-difference-between-physical-and-virtual-cache post interesting. Can you jump in on this discussion and shed some light on this, please and thanks? Do L2/L3 caches tend to be physically indexed? I think your SU answer gives some clues to how Skylake might have dropped L1 associativity to 4-way (to save power) while still keeping it 32kiB, even though this should require the 13th address bit to index it. Do they maybe fetch two sets of tags, and only decide which one to check based on the TLB output? – Peter Cordes Nov 29 '15 at 13:57
  • 1
    I do not know of any instance of a conventional architecture having virtually indexed L2/L3 cache. ([The Mill](http://millcomputing.com/wiki/Main_Page) is (will be?) unusual in having a single address space, allowing translation to be delayed.) I do not know how Skylake handles aliasing issues. With MESI, a modified line must be written back (not flushed) if another cache wants to read it. Providing a Forwarding or Owned state allows a dirty cache line to be shared. (I do not remember when Intel moved from MESI to MESIF.) BTW, notifications do not work across posts (or at least questions). –  Nov 29 '15 at 22:15