Tweaking java classes for CPU cache friendliness

Question

When designing java classes, what are the recommendations for achieving CPU cache friendliness?

What I have learned so far is that one should use POD as much as possible (i.e. int instead of integer). Thus, the data will be allocated consecutively when allocating the containing object. E.g.

class Local
{
    private int data0;
    private int data1;
    // ...
};

is more cache friendly than

class NoSoLocal
{
    private Integer data0;
    private Integer data1;
    //...
};

The latter will require two separate allocations for the Integer objects that can be at arbitrary locations in memory, esp. after a GC run. OTOH the first approach might lead to duplication of data in cases where the data can be reused.

Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?

Do you want to keep arbitrary reference type instances "close to each other in memory" or is this specifically about boxed POD? — , Jun 23 '14 at 13:14
Its more about keeping related object, i.e. all the elements (not only their references) close to each other in memory. — ogni42, Jun 23 '14 at 13:29
If you're really worried about this, Java is not the language for you — Amir Afghani, Jun 23 '14 at 21:49
So I did some benchmark tests: In a single threaded scenario the difference is between 1.2 and 1.9 in favor of local data. In a multithreaded scenario, the advantage of local data is between 9-20 times. So it is surely worth spending some effort on locality of data for server side code... — ogni42, Jul 02 '14 at 13:28
tested the same code on my i7 4770 (the other tests were on my i5 laptop): Here the local data is processed 35 times faster in a MT scenarion than data using references. — ogni42, Jul 07 '14 at 07:49

score 7 · Answer 1 · edited May 23 '17 at 12:33

You cannot force JVM to place related objects close to each other (though JVM tries to do it automatically). But there are certain tricks to make Java programs more cache-friendly.

Let me show you some examples from the real-life projects.

BEWARE! This is not a recommended way to code in Java!
Do not adopt the following techniques unless you are absolutely sure why you are doing it.

Inheritance over composition. You've definitely heard the contrary principle "Favor composition over inheritance". But with composition you have an extra reference to follow. This is not good for cache locality and also requires more memory. The classic example of inheritance over composition is JDK 8 Adder and Accumulator classes which extend utility Striped64 class.
Transform arrays of structures into a structure of arrays. This again helps to save memory and to speed-up bulk operations on a single field, e.g. key lookups:
```
class Entry {
    long key;
    Object value;
}

Entry[] entries;
```
will be replaced with
```
long[] keys;
Object[] values;
```

Flatten data structures by inlining. My favorite example is inlining 160-bit SHA1 hash represented by byte[]. The code before:

class Blob {
    long offset;
    int length;
    byte[] sha1_hash;
}

The code after:

class Blob {
    long offset;
    int length;
    int hash0, hash1, hash2, hash3, hash4;
}

Replace String with char[]. You know, String in Java contains char[] object under the hood. Why pay performance penalty for an extra reference?
Avoid linked lists. Linked lists are very cache-unfriendly. Hardware works best with linear structures. LinkedList can be often replaced with ArrayList. A classic HashMap may be replaced with an open address hash table.
Use primitive collections. Trove is a high-performance library containg specialized lists, maps, sets etc. for primitive types.
Build your own data layouts on top of arrays or ByteBuffers. A byte array is a perfect linear structure. To achieve the best cache locality you can pack an object data manually into a single byte array.

score 1 · Accepted Answer · answered Jun 23 '14 at 13:36

the first approach might lead to duplication of data in cases where the data can be reused.

But not in the case you mention. An int is 4 bytes, a reference is typically 4-bytes so you don't gain anything by using Integer. For a more complex type, it can make a big difference however.

Is there a way to have them located close to each other in memory so that the parent object and its' containing elements will be in the CPU cache at once and not distributed arbitrarily over the whole memory plus the GC will keep them together?

The GC will do this anyway, provided the objects are only used in one place. If the objects are used in multiple places, they will be close to one reference.

Note: this is not guaranteed to be the case, however when allocating objects they will typically be continuous in memory as this is the simplest allocation strategy. When copying retained objects, the HotSpot GC will copy them in reverse order of discovery. i.e. they are still together but in reverse order.

Note 2: Using 4 bytes for an int is still going to be more efficient than using 28 bytes for an Integer (4 bytes for reference, 16 bytes for object header, 4 bytes for value and 4 bytes for padding)

Note 3: Above all, you should favour clarity over performance, unless you have measured you need and have a more performant solution. In this case, an int cannot be a null but an integer can be null. If you want a value which should not be null use int, not for performance but for clarity.

`But not in the case you mention.` I know, this was just meant as example. In my current case it is a more complex object structure where the elements will be filled case by case. The important point is: Will it be guaranteed that objects which are owned by only one other object will alway be located close to the owning object — ogni42, Jun 23 '14 at 13:48
@ogni42 It is not guaranteed, and is impossible if the object is used more than once. However, if the object is referenced once it is *likely* to be together if they are created by the program or copied by the GC at the same time. — Peter Lawrey, Jun 23 '14 at 13:51

Tweaking java classes for CPU cache friendliness

2 Answers2

Linked