4

According to “Intel 64 and IA-32 architectures optimization reference manual,” April 2012 page 2-23

The physical addresses of data kept in the LLC data arrays are distributed among the cache slices by a hash function, such that addresses are uniformly distributed. The data array in a cache block may have 4/8/12/16 ways corresponding to 0.5M/1M/1.5M/2M block size. However, due to the address distribution among the cache blocks from the software point of view, this does not appear as a normal N-way cache.

My computer is a 2-core Sandy Bridge with a 3 MB, 12-way set associative LLC cache. That does not seem to be coherent with Intels documentation though. According to the data it seems that I should have 24-ways. I can imagine there is something going on with the number of cores/cache-slices but I can't quite figure it out. If I have 2 cores and hence 2 cache slices 1.5 MB per slice, I would have 12 ways per cache slice according to Intel and that does not seem consistent with my CPU specs. Can someone clarify this to me?

If I wanted to evict an entire cache line would I need to access the cache in strides of 128 KB or 256 KB? In fact this is what I am trying to achieve.

Any suggested readings are very welcome.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
alex10791
  • 434
  • 4
  • 11

2 Answers2

5

Associativity is orthogonal to the number of slices or to the mapping done by the hash function. If a given address is mapped to some cache slice(and a given set within it), it can only compete over the ways with other lines that were mapped to the same place. Having 2 slices does not raise associativity, it only reduces the contention (since lines are evenly distributed over more sets eventually).

Therefore you have 12 ways per slice, but the overall associativity per set is still 12 ways.

If you were to test your associativity by accessing different lines mapped to the same set, you will just have a harder time picking such lines (you'll need to know the hash function), but you're still going to get thrashing after 12 lines. However, if you were to ignore the hashing, and assume lines are simply mapped by their set bits, I could appear as if you have higher associativity simply because the lines would divide uniformly between the slices, so thrashing would take longer. This isn't real associativity, but it comes close for some practical purposes. It would only work if you're using a wide physical memory range though, since the upper bits need to change for the hashing to make any impact.

Leeor
  • 19,260
  • 5
  • 56
  • 87
  • I'm using huge pages (2MB). So you suggest that if I am only interested in clearing a single cache line I should use a stride of 128 KB to jump within the same cache-slice? In fact kind of follows my observations and makes a lot of sense now. – alex10791 May 11 '16 at 12:27
  • I didn't do the math, so maybe even 64k could work, but that sounds right. 2M pages would indeed save a lot of headache here. By the way - you might find this paper interesting - http://arxiv.org/pdf/1508.03767v1.pdf – Leeor May 11 '16 at 13:19
4

Having 2 slices doubles the number of sets, not the number of ways per set. The latter would require every slice to check its tags for a set, so bandwidth wouldn't scale with cores (where every core has a slice of L3).

The actual design means that the index determines a single stop on the ring bus which needs to handle a request for a single line.


If I wanted to evict an entire cache line would I need to access the cache in strides of 128 KB or 256 KB? In fact this is what I am trying to achieve.

Neither, it's not that simple. Unlike the smaller / faster caches, the index for the last-level cache isn't a simple range of bits from the address. It's more like a hash function of all the address bits above the offset into the cache line, which reduces collisions when large strides happen by accident, or when multiple programs (or instances of the same program) on the same system use the same offset relative to a hugepage or whatever other boundary.

The last-level cache indexing function is one of Intel's secret ingredients; AFAIK it hasn't been reverse-engineered or published, but I haven't gone looking.

Obviously you can use a large buffer to have a very high chance of having evicted a line before you come to it again, but IDK if there's a good way otherwise. clflushopt has similar cost to a store; having to make sure no copy of the cache line still exists.

prefetchnta prefetches into L1, and into L3 with fast eviction (using only limited ways). In practice it can give L3 misses with a working set smaller than L3 without force-evictions, only effectively conflict misses.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    There has been some work on reversing the LLC hash algorithm for some Intel CPUs. OP: you can Google for papers on this RE - it usually comes up in the context of side-band attacks on code running on another core but sharing LLC. – BeeOnRope Dec 26 '17 at 19:53
  • 1
    I have worked on cache timing attacks as my master thesis and have successfully reversed some processor cache slice selection algorithms. The one was a Skylake processor. However I had the assumption that the replacement policy was LRU which is not the case reducing the success rate. I managed to leak AES keys from one process to another by manipulating the cache content, invoking the victim process and observing the new cache access times. The code for the attack is here https://github.com/alex10791/ctattack and for reversing is here https://github.com/alex10791/cssarev – alex10791 Dec 28 '17 at 11:56
  • 1
    @alex10791: Cool! Thanks for the links. For some reverse-engineering on Intel's apaptive LLC replacement policy (which has been a feature since IvyBridge), see http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/. – Peter Cordes Dec 28 '17 at 16:06
  • @PeterCordes Haha, I have been over that article like a million times trying to understand it back then :) I could have done so much better had I known what I know today but thanks for it. I would really like to work further on it but there is always something more important to do. The code is really messy though. For the reversing you would need to use some boolean equation solving tool to deduce the equations, I used sage and solved in Galois field 2. I'm not 100% sure if there is sufficient code in my repository to do that. Its based on [this](https://eprint.iacr.org/2015/690.pdf) paper. – alex10791 Dec 28 '17 at 19:10
  • @PeterCordes I think that the cache slice selection algorithm has been there since Sandy Bridge actually. In any case I could find my thesis report if you are interested in more detail about what was done. – alex10791 Dec 28 '17 at 19:16
  • @alex10791: Thanks for the offer. I don't think I'd take the time to actually read much of it, though. If you want to post a link for future readers, a comment here or in your github would probably be cool. And BTW, yeah I've had the same experience of looking back on something that was confusing at the time and wishing I'd known then what I know now about CPU architecture, or some specific microarchitectural detail. – Peter Cordes Dec 29 '17 at 02:05
  • A quick clarification on your statement "The last-level cache indexing function is one of Intel's secret ingredients; AFAIK it hasn't been reverse-engineered or published, but I haven't gone looking.". You were referring to the mapping from Physical address to LLC slice right? Not the mapping from an address to a particular set in the cache (It seems that it is necessary we use the lower order bits of the address to determine the set - log(numSets)) – YetAnotherPseudoName Feb 18 '20 at 20:03
  • @YetAnotherPseudoName: No, I really do mean indexing individual sets. I might be wrong, but my understanding is that (like a software hash table) it computes a hash function based on some / all bits above the offset-within-line part of the physical address. See discussion in comments. L3 latency is high enough already that avoiding conflict misses from lots of 4k-aligned buffers is useful. Checking full tags (which can overlap with the hashed bits) makes this strategy work. – Peter Cordes Feb 19 '20 at 00:59
  • @PeterCordes That's interesting. I didnt realize that the tag bits at LLC would be more than the simple (addr_bits - set_index_bits - block_offset_bits). I guess more recent designs that change the addr-to-set mapping for security (for eg. CEASER http://memlab.ece.gatech.edu/papers/MICRO_2018_2.pdf) would also need to store the full tag bits – YetAnotherPseudoName Feb 19 '20 at 06:12