4

Say I am storing a linked list of objects, each a struct of size 64 bytes, which is also the size of my cache line. I'll be doing latency-sensitive adds, deletes, and iterations on the linked list over time. I understand that the performance is dominated by whether the objects are kept in the cache, so that access is ~1 nano instead of >50 nanos for RAM access.

It is generally recommended to accomplish this with spatial locality, ideally storing the objects in a contiguous block of memory. This is useful because whenever I access a memory address, the processor actually pulls in a cache line's worth of data; we want this additional data to be other useful objects so we put all our objects in a contiguous block.

I may be misunderstanding, but it seems that we get no benefit from this effect if the object size >= the cache line size. Only one object can ever be brought into the cache at a time.

rampatowl
  • 1,722
  • 1
  • 17
  • 38
  • 1
    Only one object can be brought into cache, but with good predictiveness, the CPU can be happily preloading the cache while performing other tasks or crunching the numbers that it has loaded. – user4581301 May 04 '18 at 17:34
  • Could you clarify the implications of the second part of your sentence are with regard to spatial locality? Are you saying that if I access 1 object, the cache might predictively start loading other objects if they are close to the original object in memory? – rampatowl May 04 '18 at 17:38
  • When the data is all nicely lined up and the access pattern is predictable, the CPU knows exactly where to look for the next value and when it's going to need it, so it will try to load stuff before you need it. Here's one of the classic Q/As on a related topic: [Why is it faster to process a sorted array than an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array) – user4581301 May 04 '18 at 18:17
  • 2
    @user4581301, you're right in saying that HW prefetching would be useful here (you can say that it extends spatial locality beyond a single cache line), but the link about the sorted array deals with a different issue (branch prediction). – Leeor May 05 '18 at 19:38
  • @Leeor Yeah. *Related* topic. Trying to bang home that predictability generally brings benefits in more ways than one. – user4581301 May 05 '18 at 20:52

1 Answers1

4

The other factor to consider outside of the benefits of pre-loading subsequent items when the data size is less than the cache size is the issue of associativity and mapping. In the case of a linked list, you have no coherent layout (or at least no guarantee of such), so you are much more likely to hit collisions than if the data is laid out with spatial locality. On top of that, you do risk come level of memory fragmentation with the linked list model, although I'm not sure whether that's something you should even worry about.

Depending on the usage, access patterns, etc, for what you're doing, it's definitely worth weighing the relative benefits of algorithm efficiency (deletes are very cheap in a linked list, expensive in an array or similar). If you're doing a lot of deleting/inserting, then the benefits of algorithm efficiency could far outweigh any benefits from cache coherency.

To clarify the associativity concept, worth taking a gander here. Basically, the associativity of the cache dictates how many locations in the cache a specific address can map to. Different cache levels will have difference associativity, so for example in most cases the L1 cache is 2 or 4-way set associative, meaning any address can map to one of two (or four) locations in the cache. L2 and L3 are more likely to be 8 (or sometimes 12 or 16) way associative.

As far as the specific associativity of Intel/AMD/etc CPUs, that's a tougher call, since even Intel folks have a hard time coming up with a good answer! One example I found was for the Xeon x5660 CPU, which is 4 way set associative instruction in L1, 8-way set associative data, 8-way set associative in the L2, and 16-way in L3.

The algorithms used by modern CPUs for cache usage, etc, are pretty darned amazing, and go beyond just the basics outlined here, so I think in practice you'd find very little impact from this.

Varrak
  • 708
  • 3
  • 13
  • Thanks for the reply Steve! I hadn't considered cache associativity. Are cache mappings generally implemented such that nearby memory addresses are mapped to distinct cache locations even in a k-way associative cache? To clarify my design question some: I'm set on using a linked list for the fast deletes you mentioned. I'm also planning on using an object pool to minimize runtime allocations. Within the object pool, is there any advantage to having neighboring linked list objects in adjacent memory locations, or is the order irrelevant since object size > cache line? – rampatowl May 04 '18 at 17:23
  • Ah, that makes sense. The object pool idea will be a huge win (cutting down on the number of allocations is always a win, since they can be costly). Also allocating a nice big pool up front will give you some level of memory coherency that'll help some too. Given how good modern CPUs are at cache utilization, I'm pretty sure the answer is that ordering is pretty much irrelevant, for your usage pattern, and that the algorithm and optimizations like the object pool are significantly more important. – Varrak May 04 '18 at 18:15
  • Gotcha. Thanks for the intuition, I appreciate the help! – rampatowl May 04 '18 at 18:20