4

In the csapp textbook, the description of memory mountain denotes that increasing working size worsens temporal locality, but I feel like both size and stride factors contribute to spatial locality only, as throughput decreases when more data is sparsely stored in lower level caches.

Where is temporal locality in play here? As far as I know, it means the same specific memory address is referenced again in the near future as seen in this answer: What is locality of reference?

memory mountain chart

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
mzoz
  • 1,273
  • 1
  • 14
  • 28
  • quote from the textbook: `... Smaller values of size result in a smaller working set size, and thus better temporal locality. Smaller values of stride result in better spatial locality. ...` – mzoz Jun 23 '19 at 05:36
  • Your understanding of "temporal locality" is correct, I believe. Assuming completely random access to data, at a constant rate of access, note that a smaller working size means any given memory address will be accessed more frequently. This would translate directly into improved temporal locality. Conversely, as working size goes up, each memory address is accessed less frequently. Consider the degenerate example of looping repeatedly through the entire working data set. The smaller the set, the more frequently you'll complete a loop and repeat, thus accessing the same addresses as before. – Peter Duniho Jun 23 '19 at 05:58
  • Temporal locality depends on the nature of the data set traversal, so it exists in a sequential stride regardless of its size. However, since temporal locality must be exploited by a finite mechanism, it is easier to *extract* and *stored* over smaller data sets. – Leeor Jun 23 '19 at 14:42
  • By the way, the definitions on your link were a little off, tried to elaborate there. – Leeor Jun 23 '19 at 15:09

1 Answers1

2

This graph is produced by sequentially traversing fixed-size elements of an array. The stride parameter specifies the number of elements to be skipped between two sequentially accessed elements. The size parameter specifies the total size of the array (including the elements that may be skipped). The main loop of the test looks like this (you can get the code from here):

for (i = 0; i < size / sizeof(double); i += stride*4) {
    acc0 = acc0 + data[i];     
    acc1 = acc1 + data[i+stride];
    acc2 = acc2 + data[i+stride*2]; 
    acc3 = acc3 + data[i+stride*3];
}

That loop is shown in the book in Figure 6.40. What is not shown or mentioned in the book is that this loop is executed once to warm up the cache hierarchy and then memory throughput is measured for a number of runs. The minimum memory throughput of all the runs (on the warmed up cache) is the one that is plotted.

Both the size and stride parameters together affect temporal locality (but only the stride affects spatial locality). For example, the 32k-s0 configuration has a similar temporal locality as the 64k-s1 configuration because the first access and last access to every line are interleaved by the same number of cache lines. If you hold the size at a particular value and go along the stride axis, some lines that are repeatedly accessed at a lower stride would not be accessed at higher strides, making their temporal locality essentially zero. It's possible to define temporal locality formally, but I'll not do that to answer the question. On the other hand, if you hold the stride at a particular value and go along the size axis, temporal locality for each accessed line becomes smaller with higher sizes. However, performance deteriorates not because of the uniformly lower temporal locality of each accessed line, but because of the larger working set size.

I think the size axis better illustrates the impact of the size of the working set (the amount of memory the loop is going to access during its execution) on execution time than temporal locality. To observe the impact of temporal locality on performance, the memory throughput of the first run of this loop should be compared against that of the second run of the same loop (same size and stride). Temporal locality increases by the same amount for each accessed cache line in the second run of the loop and, if the cache hierarchy is optimized for temporal locality, the throughput of the second run should be better than that of the first. In general, the throughput of each of N sequential invocations of the same loop should be plotted to see the full impact of temporal locality, where N >= 2.

By the way, memory mountains on other processors can be found here and here. You can create a 3D mountain plot using this or this script.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • 2
    You could argue that touching more cache lines before coming back to a previous cache line is lower temporal locality. So larger size = less temporal locality. Your definition seems to be something like the number of accesses to a cache line over the life of the program, so theoretical max hit rate = `n / (n-1)` (only mandatory misses, no capacity or conflict misses). It's a bit tricky to define usefully because what really matters is how many other cache lines you touch between coming back to an earlier one, not literally *time*. – Peter Cordes Jun 27 '19 at 23:56