-3

According to the comments in the accepted answer of this question instructions and data of a program are loaded from memory into cache in chunks. How does this affect cache mapping? For example what if there is one cache miss, now a whole chunk of x bytes needs to be brought in so it could end up over writing other data/instructions in the cache that may still be needed. Overall isn't this a worse design then working with just one instruction or piece of data instead of chunks?

Is it necessary for multiple instructions to fit in one cache line because a cache line cannot be partially full, or if necessary for performance reasons can one instruction go in one cache line?

Community
  • 1
  • 1
Celeritas
  • 14,489
  • 36
  • 113
  • 194
  • 1
    It's *possible* to write programs that result in poor use of the cache, where cache lines are constantly being evicted. However, most practical programs (algorithms) tend to exhibit [locality](http://en.wikipedia.org/wiki/Locality_of_reference#Types_of_locality). – Brett Hale Mar 17 '14 at 05:24
  • I don't really understand why you even accepeted that answer. It doesn't really explain anything beyond what you already wrote yourself in the question. Any of the other answers are better than the accepted one. – Devolus Mar 17 '14 at 07:02
  • I think you need to google some more to understand how caches work. The whole idea, which is quite valid, is that if you read one location you are likely to read nearby locations. Certainly true with code which executes in linear chunks before branching. And fairly common with data. The "ways" of a cache allow multiple entries in the cache for the same address pattern, so some striping is possible. by definition though you cant hold everything in cache so old stuff has to be evicted. – old_timer Mar 17 '14 at 20:27
  • write yourself or take an existing instruction set simulator, run some real-ish code on it, examine the data and instruction fetch patterns, sometimes in loops you run over the same code, but based on the loop length it may be in separate, non-conflicting, cache lines. you would need several popular loops in order to have an issue with a lot of loss, and/or a lot of poorly placed jumps that cause a popular loop to be evicted. that is the exception not the rule. – old_timer Mar 17 '14 at 20:29
  • data on the other hand is often placed powers of two apart and can then end up striping...cache thrashing. – old_timer Mar 17 '14 at 20:29
  • Obviously caches make things faster otherwise they would not be so widely used. – old_timer Mar 17 '14 at 20:37
  • This isn't really a coding question anyway. It's about computer architecture. I'm voting to close as off-topic for this site. – Adi Inbar Mar 18 '14 at 03:55
  • The validity of this question and its closure are being discussed on [Meta](http://meta.stackexchange.com/questions/225889/is-this-question-really-too-broad-and-not-simply-off-topic). – George Stocker Mar 18 '14 at 17:12

1 Answers1

0

There are valid reasons why cache blocks are relatively large.

The fraction of storage taken by tags decreases as the block size increases. Each block must have a tag indicating the address, the validity of the cache block, and other metadata (the data for implementing the replacement policy is spread across the blocks of a given set [index]). A 4 KiB, 2-way cache on a machine with a 24-bit physical address space would require with 4 byte blocks 1024 tags 13-bit tags (including just address and a valid bit, supporting LRU replacement would add a bit per two entries). With no parity or ECC, this would be over 40% overhead. With 8 byte blocks the overhead is reduced to 512 23-bit tags (c.20%). With 16 byte blocks, the overhead is halved again.

Using larger blocks also reduces the number of requests needed to handle a given amount of memory. Each request has some overhead for communication and handling. In designs with shared address and data bandwidth, the bandwidth overhead would be similar to that for tagging. There would also be overhead in buffering requests. With snooping cache coherence (excluding old systems that could use a shared bus or snoop filtering and similar optimizations) every miss must send a probe request to every other coherent cache.

Using larger blocks also provides a simplistic prefetching mechanism. When the next level of the memory hierarchy is several times higher latency, prefetching becomes relatively attractive even if it sometimes evicts more useful data. Even for non-stack data, there is typically significant spatial locality such that the probability that a nearby location will be accessed soon is higher than the probability that a location which may not have been used for more than 500 accesses (for a 2-way cache with 512 sets and LRU replacement). In addition, such can be exploited by software to provide "free" prefetching.

Larger cache blocks are also a better fit to DRAM characteristics. Since the days of fast page mode DRAM, accessing multiple locations in the same DRAM "page" has been significantly more bandwidth-efficient than more scattered accesses. DDR3 requires bursts of four sequential locations (so a 64-bit wide interface must fetch 32 bytes), and even that uses burst chop (which impacts DRAM chip bandwidth). By using larger cache blocks, all accesses become sequential bursts. (One could narrow the width of the memory interface, but a 16-bit DDR3 interface would require two DRAM cycles to transfer 64 bits. Alternatively, one could use a prefetch buffer to store a longer burst—some memory controllers can prefetch nearby blocks to exploit DRAM page locality benefits—, but that would add some storage overhead and complexity.)

(Another, minor benefit may be facilitating way prediction and way memoization. This can be used to reduce latency or energy use. Even a simplistic predictor that associated the last hit way with the register used to generate the base address would have some success with unit stride accesses to arrays and data structures with multiple members accessed with significant temporal locality.)

While using larger cache blocks does cost excess bandwidth and storage overhead for the data itself and introduces false sharing issues in multithreaded code, the benefits are significant.