3

If I have this class:

class MyClass{
    short a;
    short b;
    short c;
};

and I have this code performing calculations on the above:

std::vector<MyClass> vec;
//
for(auto x : vec){
    sum = vec.a * (3 + vec.b) / vec.c;
}

I understand the CPU only loads the very data it needs from the L1 cache, but when the L1 cache retrieves data from the L2 cache it loads a whole "cache line" (which could include a few bytes of data it doesn't need).

How much data does the L2 cache load from the L3 cache, and the L3 cache load from main memory? Is it defined in terms of pages and if so, how would this answer differ according to different L2/L3 cache sizes?

user997112
  • 29,025
  • 43
  • 182
  • 361
  • Related question [Line size of L1 and L2 caches](http://stackoverflow.com/questions/14707803/line-size-of-l1-and-l2-caches) –  Apr 12 '14 at 12:35
  • By the way, if you did not have the division, I would suggest using a structure of vectors/arrays rather than vector/array of structures organization. Such allows convenient use of SIMD instructions. Unfortunately, most ISAs do not include SIMD division, at most providing a (parallel) single precision FP reciprocal estimate instruction which can be used with Newton-Raphson to perform division, so SIMD operations might not be helpful. –  Apr 15 '14 at 23:15

2 Answers2

7

L2 and L3 caches also have cache lines that are smaller than a virtual memory system page. The size of L2 and L3 cache lines is greater than or equal to the L1 cache line size, not uncommonly being twice that of the L1 cache line size.

For recent x86 processors, all caches use the same 64-byte cache line size. (Early Pentium 4 processors had 64-byte L1 cache lines and 128-byte L2 cache lines.)

IBM's POWER7 uses 128-byte cache blocks in L1, L2, and L3. (However, POWER4 used 128-byte blocks in L1 and L2, but sectored 512-byte blocks in the off-chip L3. Sectored blocks provide a valid bit for subblocks. For L2 and L3 caches, sectoring allows a single coherence size to be used throughout the system.)

Using a larger cache line size in last level cache reduces tag overhead and facilitates long burst accesses between the processor and main memory (longer bursts can provide more bandwidth and facilitate more extensive error correction and DRAM chip redundancy), while allowing other levels of cache and cache coherence to use smaller chunks which reduces bandwidth use and capacity waste. (Large last level cache blocks also provide a prefetching effect whose cache polluting issues are less severe because of the relatively high capacity of last level caches. However, hardware prefetching can accomplish the same effect with less waste of cache capacity.) With a smaller cache (e.g., typical L1 cache), evictions happen more frequently so the time span in which spatial locality can be exploited is smaller (i.e., it is more likely that only data in one smaller chunk will be used before the cache line is evicted). A larger cache line also reduces the number of blocks available, in some sense reducing the capacity of the cache; this capacity reduction is particularly problematic for a small cache.

  • 1
    Larger line-size for outer caches isn't used in any modern x86 CPUs. It's an interesting idea to think about, but AFAIK it's not very relevant for optimizing modern code. (Unless some ARM or ARM64 chips use it.) – Peter Cordes Mar 03 '22 at 14:21
  • @PeterCordes For caches with tag and data on the same chip, this will probably be the case (though using dense DRAM for data and SRAM for tags might favor larger cache lines). An off-chip DRAM-based cache with on-chip tags (or partial tags) could sufficiently favor a larger cache line. Cache compression and indirection (like some NUCA proposals) could favor larger lines in LLC; if a sectored cache does not always load all subblocks and invalid ones use no storage (indirection), is it different from aligned-adjacent prefetch? Yes, such is more interesting than practically useful. –  Mar 03 '22 at 20:05
4

It depends somewhat on the ISA and microarchitecture of your platform. Recent x86-64 based microarchitectures use 64 byte lines in all levels of the cache hierarchy.

Typically signed shorts will require two bytes each meaning that MyClass will need 6 bytes in addition the class overhead. If your C++ implementation stores the vector<> contiguously like an array you should get about 10 MyClass objects per 64-byte lines. Provided the vector<> is the right length, you won't load much garbage.

It's wise to note that since you're accessing the elements in a very predictable pattern the hardware prefetcher should kick in and fetch a reasonable amount of data it expects to use in the future. This could potentially bring more than you need into various levels of the cache hierarchy. It will vary from chip to chip.

hayesti
  • 2,993
  • 2
  • 23
  • 34