11

For ultra-fast code it essential that we keep locality of reference- keep as much of the data which is closely used together, in CPU cache:

http://en.wikipedia.org/wiki/Locality_of_reference

What techniques are to achieve this? Could people give examples?

I interested in Java and C/C++ examples. Interesting to know of ways people use to stop lots of cache swapping.

Greetings

intrigued_66
  • 16,082
  • 51
  • 118
  • 189
  • See this language independent question [How does one write code that best utilizes the CPU cache to improve performance](http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance) – Bo Persson Mar 22 '12 at 11:59
  • You can approach it from two sides: Shuffling the data in memory is one approach, shuffling the processing in time is another. – MSalters Mar 22 '12 at 12:02
  • @MSalters, but putting 0.5MB of data into RAM doesn't guarantee it is all in the cache at same time? – intrigued_66 Mar 22 '12 at 12:11
  • @user1107474: Well there could be other processes, on other CPU cores, that are competing for the cache. But if the CPU cache is large enough for you and the other processes, then those 512 kB *will* be in cache. To put them in RAM, the CPU has to go through the cache, and since we assumed a large-enough cache the cache won't discard the data. – MSalters Mar 22 '12 at 12:14
  • @user1107474: When people talk about locality, the main concern is helping the processor preload the cache as the process is running. That is, the problem to solve is what happens when the memory is actually not in the cache (some other process has been running, it does not fit in your L1, L2 or L3 cache... If the amount of data is very small, you can *assume* that it will be in the cache and expect not to have *many* problems (note *assume* and *many* are no guarantees). – David Rodríguez - dribeas Mar 22 '12 at 12:21
  • @David, so for VERY fast financial software they re-write the Linux OS because every single OS process occuring in the background could potential kick out the valuable algorithm data from the cache and replace it with.... ? – intrigued_66 Mar 23 '12 at 12:25
  • @user1107474: I don't know where you came to that conclusion from... I have never worked on trading, but in general if you have an appliance where you **need** performance you tweak the OS. In my previous project the kernel scheduler was tweaked, and the box had very limited things installed and running. The core processes were bound to a particular processor with a high priority leaving some specific cores free for grabs for other processes. – David Rodríguez - dribeas Mar 23 '12 at 13:26

4 Answers4

11

This is probably too generic to have clear answer. The approaches in C or C++ compared to Java will differ quite a bit (the way the language lays out objects differ).

The basic would be, keep data that will be access in close loops together. If your loop operates on type T, and it has members m1...mN, but only m1...m4 are used in the critical path, consider breaking T into T1 that contains m1...m4 and T2 that contains m4...mN. You might want to add to T1 a pointer that refers to T2. Try to avoid objects that are unaligned with respect to cache boundaries (very platform dependent).

Use contiguous containers (plain old array in C, vector in C++) and try to manage the iterations to go up or down, but not randomly jumping all over the container. Linked Lists are killers for locality, two consecutive nodes in a list might be at completely different random locations.

Object containers (and generics) in Java are also a killer, while in a Vector the references are contiguous, the actual objects are not (there is an extra level of indirection). In Java there are a lot of extra variables (if you new two objects one right after the other, the objects will probably end up being in almost contiguous memory locations, even though there will be some extra information (usually two or three pointers) of Object management data in between. GC will move objects around, but hopefully won't make things much worse than it was before it run.

If you are focusing in Java, create compact data structures, if you have an object that has a position, and that is to be accessed in a tight loop, consider holding an x and y primitive types inside your object rather than creating a Point and holding a reference to it. Reference types need to be newed, and that means a different allocation, an extra indirection and less locality.

w_n
  • 345
  • 3
  • 7
David Rodríguez - dribeas
  • 204,818
  • 23
  • 294
  • 489
  • A good guide for optimization in C/C++ which treats all these issues is "Optimizing software in C++" A full part refers to optimization of locality of references. www.agner.org/optimize/optimizing_cpp.pdf – linello Mar 22 '12 at 13:18
5

Two common techniques include:

  • Minimalism (of data size and/or code size/paths)
  • Use cache oblivious techniques

Example for minimalism: In ray tracing (a 3d graphics rendering paradigm), it is a common approach to use 8 byte Kd-trees to store static scene data. The traversal algorithm fits in just a few lines of code. Then, the Kd-tree is often compiled in a manner that minimalizes the number of traversal steps by having large, empty nodes at the top of tree ("Surface Area Heuristics" by Havran).

Mispredictions typically have a probability of 50%, but are of minor costs, because really many nodes fit in a cache-line (consider that you get 128 nodes per KiB!), and one of the two child nodes is always a direct neighbour in memory.

Example for cache oblivious techniques: Morton array indexing, also known as Z-order-curve-indexing. This kind of indexing might be preferred if you usually access nearby array elements in unpredictable direction. This might be valuable for large image or voxel data where you might have 32 or even 64 bytes big pixels, and then millions of them (typical compact camera measure is Megapixels, right?) or even thousands of billions for scientific simulations.

However, both techniques have one thing in common: Keep most frequently accessed stuff nearby, the less frequently things can be further away, spanning the whole range of L1 cache over main memory to harddisk, then other computers in the same room, next room, same country, worldwide, other planets.

Sebastian Mach
  • 38,570
  • 8
  • 95
  • 130
1

Some random tricks that come to my mind, and which some of them I used recently:

Rethink your algorithm. For example, you have an image with a shape and the processing algorithm that looks for corners of the shape. Instead of operating on the image data directly, you can preprocess it, save all the shape's pixel coordinates in a list and then operate on the list. You avoid random the jumping around the image

Shrink data types. Regular int will take 4 bytes, and if you manage to use e.g. uint16_t you will cache 2x more stuff

Sometimes you can use bitmaps, I used it for processing a binary image. I stored pixel per bit, so I could fit 8*32 pixels in a single cache line. It really boosted the performance

Form Java, you can use JNI (it's not difficult) and implement your critical code in C to control the memory

Jakub M.
  • 32,471
  • 48
  • 110
  • 179
0

In the Java world the JIT is going to be working hard to achieve this, and trying to second guess this is likely to be counterproductive. This SO question addresses Java-specific issues more fully.

Community
  • 1
  • 1
Brian Agnew
  • 268,207
  • 37
  • 334
  • 440