9

Assume I have a large array of relatively small objects, which I need to iterate frequently.
I would like to optimize my iteration by improving cache performance, so I would like to allocate the objects [and not the reference] contiguously on the memory, so I'll get fewer cache misses, and the overall performance could be segnificantly better.

In C++, I could just allocate an array of the objects, and it will allocate them as I wanted, but in java - when allocating an array, I only allocate the reference, and the allocation is being done one object at a time.

I am aware that if I allocate the objects "at once" [one after the other], the jvm is most likely to allocate the objects as contiguous as it can, but it might be not enough if the memory is fragmented.

My questions:

  1. Is there a way to tell the jvm to defrag the memory just before I start allocating my objects? Will it be enough to ensure [as much as possible] that the objects will be allocated continiously?
  2. Is there a different solution to this issue?
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
amit
  • 175,853
  • 27
  • 231
  • 333
  • Do you mean *contiguously*? I suspect so - it may be worth updating your question. – Jon Skeet Mar 09 '12 at 10:14
  • 1
    You might want to check [this SO question](http://stackoverflow.com/questions/5198271/large-array-allocation-across-young-and-tenured-portions-of-java-heap) – Alexander Pavlov Mar 09 '12 at 10:15

3 Answers3

16

New objects are creating in the Eden space. The eden space is never fragmented. It is always empty after a GC.

The problem you have is when a GC is performed, object can be arranged randomly in memory or even surprisingly in the reverse order they are referenced.

A work around is to store the fields as a series of arrays. I call this a column-based table instead of a row based table.

e.g. Instead of writing

class PointCount {
    double x, y;
    int count;
}

PointCount[] pc = new lots of small objects.

use columns based data types.

class PointCounts {
    double[] xs, ys;
    int[] counts;
}

or

class PointCounts {
    TDoubleArrayList xs, ys;
    TIntArrayList counts;
}

The arrays themselves could be in up to three different places, but the data is otherwise always continuous. This can even be marginally more efficient if you perform operations on a subset of fields.

public int totalCount() {
   int sum = 0;
   // counts are continuous without anything between the values.
   for(int i: counts) sum += i;
   return i;
}

A solution I use is to avoid GC overhead for having large amounts of data is to use an interface to access a direct or memory mapped ByteBuffer

import java.nio.ByteBuffer;
import java.nio.ByteOrder;

public class MyCounters {
    public static void main(String... args) {
        Runtime rt = Runtime.getRuntime();
        long used1 = rt.totalMemory() - rt.freeMemory();
        long start = System.nanoTime();
        int length = 100 * 1000 * 1000;
        PointCount pc = new PointCountImpl(length);
        for (int i = 0; i < length; i++) {
            pc.index(i);
            pc.setX(i);
            pc.setY(-i);
            pc.setCount(1);
        }
        for (int i = 0; i < length; i++) {
            pc.index(i);
            if (pc.getX() != i) throw new AssertionError();
            if (pc.getY() != -i) throw new AssertionError();
            if (pc.getCount() != 1) throw new AssertionError();
        }
        long time = System.nanoTime() - start;
        long used2 = rt.totalMemory() - rt.freeMemory();
        System.out.printf("Creating an array of %,d used %,d bytes of heap and tool %.1f seconds to set and get%n",
                length, (used2 - used1), time / 1e9);
    }
}

interface PointCount {
    // set the index of the element referred to.
    public void index(int index);

    public double getX();

    public void setX(double x);

    public double getY();

    public void setY(double y);

    public int getCount();

    public void setCount(int count);

    public void incrementCount();
}

class PointCountImpl implements PointCount {
    static final int X_OFFSET = 0;
    static final int Y_OFFSET = X_OFFSET + 8;
    static final int COUNT_OFFSET = Y_OFFSET + 8;
    static final int LENGTH = COUNT_OFFSET + 4;

    final ByteBuffer buffer;
    int start = 0;

    PointCountImpl(int count) {
        this(ByteBuffer.allocateDirect(count * LENGTH).order(ByteOrder.nativeOrder()));
    }

    PointCountImpl(ByteBuffer buffer) {
        this.buffer = buffer;
    }

    @Override
    public void index(int index) {
        start = index * LENGTH;
    }

    @Override
    public double getX() {
        return buffer.getDouble(start + X_OFFSET);
    }

    @Override
    public void setX(double x) {
        buffer.putDouble(start + X_OFFSET, x);
    }

    @Override
    public double getY() {
        return buffer.getDouble(start + Y_OFFSET);
    }

    @Override
    public void setY(double y) {
        buffer.putDouble(start + Y_OFFSET, y);
    }

    @Override
    public int getCount() {
        return buffer.getInt(start + COUNT_OFFSET);
    }

    @Override
    public void setCount(int count) {
        buffer.putInt(start + COUNT_OFFSET, count);
    }

    @Override
    public void incrementCount() {
        setCount(getCount() + 1);
    }
}

run with the -XX:-UseTLAB option (to get accurate memory allocation sizes) prints

Creating an array of 100,000,000 used 12,512 bytes of heap and took 1.8 seconds to set and get

As its off heap, it has next to no GC impact.

Jakub Arnold
  • 85,596
  • 89
  • 230
  • 327
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I +1'ed this answer anyway because of the explanation about question (1). The solution sugested is indeed useful, but might fail sometimes, if we also need the specific object for some reason [maintaing a set while searching for dupes, for example] - but nevertheless, it is a nice approach that I will consider when dealing with this issue. thanks. – amit Mar 09 '12 at 11:16
  • 1
    I have added a more complex solution I use to solve this. I also use a memory mapped file for persistence and sharing of the data between processes. – Peter Lawrey Mar 09 '12 at 12:35
2

Sadly, there is no way of ensuring objects are created/stay at adjacent memory locations in Java.

However, objects created in sequence will most likely end up adjacent to each other (of course this depends on the actual VM implementation). I'm pretty sure that the writers of the VM are aware that locality is highly desirable and don't go out of their way to scatter objects randomly around.

The Garbage Collector will at some point probably move the objects - if your objects are short lived, that should not be an issue. For long lived objects it then depends on how the GC implements moving the survivor objects. Again, I think its reasonable that the guys writing the GC have spent some thought on the matter and will perform copies in a way that does not screw locality more than unavoidable.

There are obviously no guarantees for any of above assumptions, but since we can't do anything about it anyway, stop worring :)

The only thing you can do at the java source level is to sometimes avoid composition of objects - instead you can "inline" the state you would normally put in a composite object:

class MyThing {
    int myVar;
    // ... more members

    // composite object
    Rectangle bounds;
}

instead:

class MyThing {
    int myVar;
    // ... more members

    // "inlined" rectangle
    int x, y, width, height;
}

Of course this makes the code less readable and duplicates potentially a lot of code.

Ordering class members by access pattern seems to have a slight effect (I noticed a slight alteration in a benchmarked piece of code after I had reordered some declarations), but I've never bothered to verify if its true. But it would make sense if the VM does no reordering of members.

On the same topic it would also be nice to (from a performance view) be able to reinterpret an existing primitive array as another type (e.g. cast int[] to float[]). And while you're at it, why not whish for union members as well? I sure do. But we'd have to give up a lot of platform and architecture independency in exchange for these possibilities.

Durandal
  • 19,919
  • 4
  • 36
  • 70
-4

Doesn't work that way in Java. Iteration is not a matter of increasing a pointer. There is no performance impact based on where on the heap the objects are physically stored.

If you still want to approach this in a C/C++ way, think of a Java array as an array of pointers to structs. When you loop over the array, it doesn't matter where the actual structs are allocated, you are looping over an array of pointers.

I would abandon this line of reasoning. It's not how Java works and it's also sub-optimization.

pap
  • 27,064
  • 6
  • 41
  • 46
  • 4
    `There is no performance impact based on where on the heap the objects are physically stored.` **This is a hardware issue**, not related to java or any programming language at all, an access to L1 cache can take much shorter time then accessing the RAM [even 100 times faster]. And when you are accessing contiguous addresses - you will get fewer cache misses. Also, this "sub optimization" can make an iteration double or even triple the speed - for large array and small objects, so it is not "sub optimization" – amit Mar 09 '12 at 10:26
  • 2
    Java is hardware agnostic, so you cannot guarantee an L1 cache (you are assuming a block memory transfer, which is reasonable, but depends on hardware). If you want to do this optimization, you probably want to make a C call, but you lose write-once-read-anywhere. – Luis Mar 09 '12 at 10:33
  • 1
    @amit - I realize it's a low-level issue, I'm just pointing out that in Java, you don't have that type of control. If your requirements are such that this level of optimization is necessary, I would consider a different platform than Java and the JVM. If your application will stand or fall on L1 cache misses, you need to consider more low-level implementations to get that kind of control. If not, I maintain it's sub-optimization in that it will cost more in maintenance than it delivers in performance. – pap Mar 09 '12 at 11:22
  • 2
    I don't think it has to do with high-level vs low-level. C#, arguably a higher level language than Java, allows you to solve this problem using value types, giving you contiguous memory blocks. There's no reason why Java wouldn't be able to give you this amount of control. It just doesn't and it's a fair question if there's ways to deal with this. – JulianR Mar 09 '12 at 22:11