What can I do in Java code to optimize for CPU caching?

Question

When writing a Java program, do I have influence on how the CPU will utilize its cache to store my data? For example, if I have an array that is accessed a lot, does it help if it's small enough to fit in one cache line (typically 128 byte on a 64-bit machine)? What if I keep a much used object within that limit, can I expect the memory used by it's members to be close together and staying in cache?

Background: I'm building a compressed digital tree, that's heavily inspired by the Judy arrays, which are in C. While I'm mostly after its node compression techniques, Judy has CPU cache optimization as a central design goal and the node types as well as the heuristics for switching between them are heavily influenced by that. I was wondering if I have any chance of getting those benefits, too?

Edit: The general advice of the answers so far is, don't try to microoptimize machine-level details when you're so far away from the machine as you're in Java. I totally agree, so felt I had to add some (hopefully) clarifying comments, to better explain why I think the question still makes sense. These are below:

There are some things that are just generally easier for computers to handle because of the way they are built. I have seen Java code run noticeably faster on compressed data (from memory), even though the decompression had to use additional CPU cycles. If the data were stored on disk, it's obvious why that is so, but of course in RAM it's the same principle.

Now, computer science has lots to say about what those things are, for example, locality of reference is great in C and I guess it's still great in Java, maybe even more so, if it helps the optimizing runtime to do more clever things. But how you accomplish it might be very different. In C, I might write code that manages larger chunks of memory itself and uses adjacent pointers for related data.

In Java, I can't (and don't want to) know much about how memory is going to be managed by a particular runtime. So I have to take optimizations to a higher level of abstraction, too. My question is basically, how do I do that? For locality of reference, what does "close together" mean at the level of abstraction I'm working on in Java? Same object? Same type? Same array?

In general, I don't think that abstraction layers change the "laws of physics", metaphorically speaking. Doubling your array in size every time you run out of space is a good strategy in Java, too, even though you don't call malloc() anymore.

This is a valid question. Right now I'm seeing a variation of >3 in runtime for multiplying two 2000x2000 matrices of ints using the naive representation and n^3 algorithm. I assume it is an effect of caching due to different relative positions of the arrays. — starblue, Oct 03 '09 at 16:40
Thanks, starblue. I'm pretty sure that there are things that you can do, but compared to C, there might be more things to consider in Java, such as not obstructing optimizations that the runtime could do. — Hanno Fietz, Oct 05 '09 at 07:54
RE your edit: Sure, you still want to optimize your code even though you're farther from the bare metal. To take a trivial example, you should surely use a boolean rather than a String containing "true" or "false" because this is more efficient -- as well as for others reasons like maintainability and reliability, but performance is a valid entry on the list. The question is not whether you should try to make your code efficient, but how you go about it and recognizing the trade-offs. — Jay, Aug 26 '11 at 20:29
JVM's are different - learn the characteristics of yours, including all the JIT settings which is what control what you need here, and be prepared to do it again within a few years. — Thorbjørn Ravn Andersen, Jul 01 '12 at 06:48

erickson · Accepted Answer · 2015-05-15T16:02:30.163

23

The key to good performance with Java is to write idiomatic code, rather than trying to outwit the JIT compiler. If you write your code to try to influence it to do things in a certain way at the native instruction level, you are more likely to shoot yourself in the foot.

That isn't to say that common principles like locality of reference don't matter. They do, but I would consider the use of arrays and such to be performance-aware, idiomatic code, but not "tricky."

HotSpot and other optimizing runtimes are extremely clever about how they optimize code for specific processors. (For an example, check out this discussion.) If I were an expert machine language programmer, I'd write machine language, not Java. And if I'm not, it would be unwise to think that I could do a better job of optimizing my code than the experts.

Also, even if you do know the best way to implement something for a particular CPU, the beauty of Java is write-once-run-anywhere. Clever tricks to "optimize" Java code tend to make optimization opportunities harder for the JIT to recognize. Straight-forward code that adheres to common idioms is easier for an optimizer to recognize. So even when you get the best Java code for your testbed, that code might perform horribly on a different architecture, or at best, fail to take advantages of enhancements in future JITs.

If you want good performance, keep it simple. Teams of really smart people are working to make it fast.

edited May 15 '15 at 16:02

answered Sep 25 '09 at 18:02

erickson

265,237
58
395
493

13

Still, there are certain things in code that are just generally easier for computers, regardless of how many layers of abstraction are on top of each other. I have seen Java code run noticably faster on data that was compressed in memory, although the decompression had to take additional CPU cycles. This is just because transferring data from RAM to CPU is a bottleneck that no JIT is going to overcome if you insist on keeping your data large. Also, if I put data in a small array, I doubt that any JVM will start scattering it all over the RAM, so I can improve locality of reference by doing it. – Hanno Fietz Sep 26 '09 at 13:13
3

That's a very dangerous line of thinking. "I doubt any JVM..." It's completely within the right of any implementer of an abstraction to change it's implmentation if they don't change the semantics of the interface, this is why you really should try and work at a level appropriate to the abstraction. – Falaina Sep 26 '09 at 14:01
Yes, Hanno, I think you have some good points. But they are points that are just broad strokes good computer science -- improve locality of reference, minimizing I/O, etc. You aren't getting down to the level of "most L1 data cache is sized like this so..." – erickson Sep 27 '09 at 04:23
8

I don't think that just saying "trust the jit" is especially useful advice when talking about memory locality. The JIT is not going to move fields from one class to another to improve locality, nor will it tell the GC to keep objects close together in memory. I have observed situations where a program slows down significantly after the first GC because tenured objects are copied in a sub-optimal order. The vast majority of Java programs aren't like this, of course, but when you find yourself in this situation, "idiomatic coding" is not going to help you. – Christopher Barber Dec 01 '15 at 17:46

Engineer · Answer 2 · 2012-08-21T19:47:25.763

If the data you're crunching is primarily or wholly made up of primitives (eg. in numeric problems), I would advise the following.

Allocate a flat structure of fixed size arrays-of-primitives at initialisation-time, and make sure the data therein is periodically compacted/defragmented (0->n where n is the smallest max index possible given your element count), to be iterated over using a for-loop. This is the only way to guarantee contiguous allocation in Java, and compaction further serves to improves locality of reference. Compaction is beneficial, as it reduces the need to iterate over unused elements, reducing the number of conditionals: As the for loop iterates, the termination occurs earlier, and less iteration = less movement through the heap = fewer chances for a cache miss. While compaction creates an overhead in and of itself, this may be done only periodically (with respect to your primary areas of processing) if you so choose.

Even better, you can interleave values in these pre-allocated arrays. For instance, if you are representing spatial transforms for many thousands of entities in 2D space, and are processing the equations of motion for each such, you might have a tight loop like

int axIdx, ayIdx, vxIdx, vyIdx, xIdx, yIdx;

//Acceleration, velocity, and displacement for each
//of x and y totals 6 elements per entity.
for (axIdx = 0; axIdx < array.length; axIdx += 6) 
{
    ayIdx = axIdx+1;
    vxIdx = axIdx+2;
    vyIdx = axIdx+3;
    xIdx = axIdx+4;
    yIdx = axIdx+5;

    //velocity1 = velocity0 + acceleration 
    array[vxIdx] += array[axIdx];
    array[vyIdx] += array[ayIdx];

    //displacement1 = displacement0 + velocity
    array[xIdx] += array[vxIdx];
    array[yIdx] += array[vxIdx];
}

This example ignores such issues as rendering of those entities using their associated (x,y)... rendering always requires non-primitives (thus, references/pointers). If you do need such object instances, then you can no longer guarantee locality of reference, and will likely be jumping around all over the heap. So if you can split your code into sections where you have primitive-intensive processing as shown above, then this approach will help you a lot. For games at least, AI, dynamic terrain, and physics can be some of the most processor-intensives aspect, and are all numeric, so this approach can be very beneficial.

This is a really good answer. One of the worst things about Java is the lack of directly accessible user types. The issue was probably not that important 1990 when the language was designed, but it is today. So the only *true* answer to the question is "Do not care, or choose a different language, possible C# (has real structs), or one of the big _native_ languages: C, C++, Fortran..." — user877329, Apr 29 '16 at 07:11
@user877329 Yes. These days, primarily for reasons of more direct memory access, I use pure C where I can. — Engineer, Apr 29 '16 at 07:14
For my own reference as well as others', see also [this answer](https://stackoverflow.com/questions/9632288/can-i-allocate-objects-contiguously-in-java). — Engineer, Nov 29 '18 at 15:40
The Saxon XSLT processor implements a tree model of XML using arrays of int primitives; this was originally done for memory reasons but I think it also delivers excellent locality of reference when doing searching and navigation. — Michael Kay, Apr 03 '20 at 09:31

score 5 · Answer 3 · answered Sep 25 '09 at 17:29

5

If you are down to where an improvement of a few percent makes a difference, use C where you'll get an improvement of 50-100%!

If you think that the ease of use of Java makes it a better language to use, then don't screw it up with questionable optimizations.

The good news is that Java will do a lot of stuff beneath the covers to improve your code at runtime, but it almost certainly won't do the kind of optimizations you're talking about.

If you decide to go with Java, just write your code as clearly as you can, don't take minor optimizations into account at all. (Major ones like using the right collections for the right job, not allocating/freeing objects inside a loop, etc. are still worth while)

answered Sep 25 '09 at 17:29

Bill K

62,186
18
105
157

3

Ditto. The moral of the story is NOT "don't try to optimize". It's "optimize within the structure of the language". When you're thinking about how to trick the compiler into generating better machine code, you're at best wasting your time. I strongly suspect that the people who wrote the JVM would have difficulty figuring out exactly how to construct the sort of trick the questioner is talking about, never mine the majority of us who have only the vaguest idea what's really going on inside the JVM. – Jay Sep 25 '09 at 20:59
Although I agree with you @Jay, I usually use different semantics. Optimizing is trying to find a way to pervert clean code in an attempt to make it faster, but choosing the correct collection, avoiding inner loops where possible, avoiding allocations in loops where possible, those are just coding correctly. (Hmm, avoiding allocating objects does fall into the category of "optimizations that corrupt your code" Hmph!) – Bill K Sep 26 '09 at 03:35
@Bill - in Java, you shouldn't even be too afraid to allocate short-lived objects. The garbage collector is very fast and efficient at cleaning up short-lived objects. (I.e. don't corrupt your code to save on allocation some short-lived objects too soon). – Jesper Sep 26 '09 at 06:54
@Jesper I agree and felt bad about saying that, but he was talking about shaving minute amounts off a tight loop and being aware of all the things you do in a tight loop is somewhat important to just being a programmer, so I left it--but I think you saw me waffling about it in my previous comment :) – Bill K Sep 26 '09 at 23:44

score 5 · Answer 4 · answered Jun 03 '13 at 22:22

So far the advice is pretty strong, in general it's best not to try and outsmart the JIT. But as you say some knowledge about the details is useful sometimes.

Regarding memory layout for objects, Sun's Jvm (now Oracle's) lays objects into memory by type (i.e. doubles and longs first, then ints and floats, then shorts and chars, after that bytes and booleans and finally object references). You can get more details here..

Local variables are usually kept in the stack (that is references and primitive types).

As Nick mentions, the best way to ensure the memory layout in Java is by using primitive arrays. That way you can make sure that data is contiguous in memory. Be careful about array sizes though, GCs have trouble with large arrays. It also has the downside that you have to do some memory management yourself.

On the upside, you can use a Flyweight pattern to get Object-like usability while keeping fast performance.

If you need the extra oomph in performance, generating your own bytecode on the fly helps with some problems, as long as the generated code is executed enough times and your VM's native code cache doesn't get full (which disables the JIT for all practical purposes).

score 3 · Answer 5 · answered Sep 25 '09 at 17:10

To the best of my knowledge: No. You pretty much have to be writing in machine code to get that level of optimization. With assembly you're a step away because you no longer control where things are stored. With a compiler you're two steps away because you don't even control the details of the generated code. With Java you're three steps away because there's a JVM interpreting your code on the fly.

I don't know of any constructs in Java that let you control things on that level of detail. In theory you could indirectly influence it by how you organize your program and data, but you're so far away that I don't see how you could do it reliably, or even know whether or not it was happening.

I disagree, in practice, the VMs are few, and massive changes in internal layout are rare. So if you know your runtime architecture, you can optimize for cache. I once had code to compute a correlation matrix. The algorithm was essentially O(N^2), I changed it to be mindful of L1/L2 cache usage, and got a 5x speedup in Java. This used primitive arrays mostly. It's hard to do the same with plain java objects, but it can be done, by for example reducing your hot working set so it fits in cache. Even choice of ArrayList vs LinkedList helps. — juancn, Jan 21 '16 at 19:32

What can I do in Java code to optimize for CPU caching?

5 Answers5

Linked