32

Observe the following program written in Java (complete runnable version follows, but the important part of the program is in the snippet a little bit further below):

import java.util.ArrayList;



/** A not easy to explain benchmark.
 */
class MultiVolatileJavaExperiment {

    public static void main(String[] args) {
        (new MultiVolatileJavaExperiment()).mainMethod(args);
    }

    int size = Integer.parseInt(System.getProperty("size"));
    int par = Integer.parseInt(System.getProperty("par"));

    public void mainMethod(String[] args) {
        int times = 0;
        if (args.length == 0) times = 1;
        else times = Integer.parseInt(args[0]);
        ArrayList < Long > measurements = new ArrayList < Long > ();

        for (int i = 0; i < times; i++) {
            long start = System.currentTimeMillis();
            run();
            long end = System.currentTimeMillis();

            long time = (end - start);
            System.out.println(i + ") Running time: " + time + " ms");
            measurements.add(time);
        }

        System.out.println(">>>");
        System.out.println(">>> All running times: " + measurements);
        System.out.println(">>>");
    }

    public void run() {
        int sz = size / par;
        ArrayList < Thread > threads = new ArrayList < Thread > ();

        for (int i = 0; i < par; i++) {
            threads.add(new Reader(sz));
            threads.get(i).start();
        }
        for (int i = 0; i < par; i++) {
            try {
                threads.get(i).join();
            } catch (Exception e) {}
        }
    }

    final class Foo {
        int x = 0;
    }

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

}

Explanation: The program is actually very simple. It loads integers size and par from the system properties (passed to jvm with the -D flag) - these are the input length and the number of threads to use later. It then parses the first command line argument which says how many time to repeat the program (we want to be sure that the JIT has done its work and have more reliable measurements).

The run method is called in each repetition. This method simply starts par threads, each of which will do a loop with size / par iterations. The thread body is defined in the Reader class. Each repetition of the loop reads a volatile member vfoo and assigns 1 to its public field. After that, vfoo is read once again and assigned to a non-volatile field bar.

Notice how most of the time the program is executing the loop body, so the run in the thread is the focus of this benchmark:

    final class Reader extends Thread {
        volatile Foo vfoo = new Foo();
        Foo bar = null;
        int sz;

        public Reader(int _sz) {
            sz = _sz;
        }

        public void run() {
            int i = 0;
            while (i < sz) {
                vfoo.x = 1;
                // with the following line commented
                // the scalability is almost linear
                bar = vfoo; // <- makes benchmark 2x slower for 2 processors - why?
                i++;
            }
        }
    }

Observations: Running java -Xmx512m -Xms512m -server -Dsize=500000000 -Dpar=1 MultiVolatileJavaExperiment 10 on an

Ubuntu Server 10.04.3 LTS
8 core Intel(R) Xeon(R) CPU  X5355  @2.66GHz
~20GB ram
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

I get the following times:

>>> All running times: [821, 750, 1011, 750, 758, 755, 1219, 751, 751, 1012]

Now, setting -Dpar=2, I get:

>>> All running times: [1618, 380, 1476, 1245, 1390, 1391, 1445, 1393, 1511, 1508]

Apparently, this doesn't scale for some reason - I would have expected the second output to be twice as fast (although it does seem to be in one of the early iterations - 380ms).

Interestingly, commenting out the line bar = vfoo (which isn't even supposed to be a volatile write), yields the following times for -Dpar set to 1,2,4,8.

>>> All running times: [762, 563, 563, 563, 563, 563, 570, 566, 563, 563]
>>> All running times: [387, 287, 285, 284, 283, 281, 282, 282, 281, 282]
>>> All running times: [204, 146, 143, 142, 141, 141, 141, 141, 141, 141]
>>> All running times: [120, 78, 74, 74, 81, 75, 73, 73, 72, 71]

It scales perfectly.

Analysis: First of all, there are no garbage collection cycles occuring here (I've added -verbose:gc as well to check this).

I get similar results on my iMac.

Each thread is writing to its own field, and different Foo object instances belonging to different threads don't appear to be ending up in the same cachelines - adding more members into Foo to increase its size doesn't change the measurements. Each thread object instance has more than enough fields to fill up the L1 cache line. So this probably isn't a memory issue.

My next thought was that the JIT might be doing something weird, because the early iterations usually do scale as expected in the uncommented version, so I checked this out by printing the assembly (see this post on how to do that).

java -Xmx512m -Xms512m -server -XX:CompileCommand=print,*Reader.run MultiVolatileJavaExperiment -Dsize=500000000 -Dpar=1 10

and I get these 2 outputs for the 2 versions for the Jitted method run in Reader. The commented (properly scalable) version:

[Verified Entry Point]
  0xf36c9fac: mov    %eax,-0x3000(%esp)
  0xf36c9fb3: push   %ebp
  0xf36c9fb4: sub    $0x8,%esp
  0xf36c9fba: mov    0x68(%ecx),%ebx
  0xf36c9fbd: test   %ebx,%ebx
  0xf36c9fbf: jle    0xf36c9fec
  0xf36c9fc1: xor    %ebx,%ebx
  0xf36c9fc3: nopw   0x0(%eax,%eax,1)
  0xf36c9fcc: xchg   %ax,%ax
  0xf36c9fd0: mov    0x6c(%ecx),%ebp
  0xf36c9fd3: test   %ebp,%ebp
  0xf36c9fd5: je     0xf36c9ff7
  0xf36c9fd7: movl   $0x1,0x8(%ebp)

---------------------------------------------

  0xf36c9fde: mov    0x68(%ecx),%ebp
  0xf36c9fe1: inc    %ebx               ; OopMap{ecx=Oop off=66}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@21 (line 83)

---------------------------------------------

  0xf36c9fe2: test   %edi,0xf7725000    ;   {poll}
  0xf36c9fe8: cmp    %ebp,%ebx
  0xf36c9fea: jl     0xf36c9fd0
  0xf36c9fec: add    $0x8,%esp
  0xf36c9fef: pop    %ebp
  0xf36c9ff0: test   %eax,0xf7725000    ;   {poll_return}
  0xf36c9ff6: ret    
  0xf36c9ff7: mov    $0xfffffff6,%ecx
  0xf36c9ffc: xchg   %ax,%ax
  0xf36c9fff: call   0xf36a56a0         ; OopMap{off=100}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf36ca004: call   0xf6f877a0         ;   {runtime_call}

The uncommented bar = vfoo (non-scalable, slower) version:

[Verified Entry Point]
  0xf3771aac: mov    %eax,-0x3000(%esp)
  0xf3771ab3: push   %ebp
  0xf3771ab4: sub    $0x8,%esp
  0xf3771aba: mov    0x68(%ecx),%ebx
  0xf3771abd: test   %ebx,%ebx
  0xf3771abf: jle    0xf3771afe
  0xf3771ac1: xor    %ebx,%ebx
  0xf3771ac3: nopw   0x0(%eax,%eax,1)
  0xf3771acc: xchg   %ax,%ax
  0xf3771ad0: mov    0x6c(%ecx),%ebp
  0xf3771ad3: test   %ebp,%ebp
  0xf3771ad5: je     0xf3771b09
  0xf3771ad7: movl   $0x1,0x8(%ebp)

-------------------------------------------------

  0xf3771ade: mov    0x6c(%ecx),%ebp
  0xf3771ae1: mov    %ebp,0x70(%ecx)
  0xf3771ae4: mov    0x68(%ecx),%edi
  0xf3771ae7: inc    %ebx
  0xf3771ae8: mov    %ecx,%eax
  0xf3771aea: shr    $0x9,%eax
  0xf3771aed: movb   $0x0,-0x3113c300(%eax)  ; OopMap{ecx=Oop off=84}
                                        ;*goto
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@29 (line 83)

-----------------------------------------------

  0xf3771af4: test   %edi,0xf77ce000    ;   {poll}
  0xf3771afa: cmp    %edi,%ebx
  0xf3771afc: jl     0xf3771ad0
  0xf3771afe: add    $0x8,%esp
  0xf3771b01: pop    %ebp
  0xf3771b02: test   %eax,0xf77ce000    ;   {poll_return}
  0xf3771b08: ret    
  0xf3771b09: mov    $0xfffffff6,%ecx
  0xf3771b0e: nop    
  0xf3771b0f: call   0xf374e6a0         ; OopMap{off=116}
                                        ;*putfield x
                                        ; - org.scalapool.bench.MultiVolatileJavaExperiment$Reader::run@15 (line 79)
                                        ;   {runtime_call}
  0xf3771b14: call   0xf70307a0         ;   {runtime_call}

The differences in the two versions are within ---------. I expected to find synchronization instructions in the assembly which might account for the performance issue - while few extra shift, mov and inc instructions might affect absolute performance numbers, I don't see how they could affect scalability.

So, I suspect that this is some sort of a memory issue related to storing to a field in the class. On the other hand, I'm also inclined to believe that the JIT does something funny, because in one iteration the measured time is twice as fast, as it should be.

Can anyone explain what is going on here? Please be precise and include references that support your claims.

Thank you!

EDIT:

Here's the bytecode for the fast (scalable) version:

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   24
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  iinc    1, 1
   21:  goto    2
   24:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 83: 18
   line 85: 24

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 21 /* same */

The slow (non-scalable) version with bar = vfoo:

public void run();
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32



  Code:
   Stack=2, Locals=2, Args_size=1
   0:   iconst_0
   1:   istore_1
   2:   iload_1
   3:   aload_0
   4:   getfield    #7; //Field sz:I
   7:   if_icmpge   32
   10:  aload_0
   11:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   14:  iconst_1
   15:  putfield    #8; //Field org/scalapool/bench/MultiVolatileJavaExperiment$Foo.x:I
   18:  aload_0
   19:  aload_0
   20:  getfield    #5; //Field vfoo:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   23:  putfield    #6; //Field bar:Lorg/scalapool/bench/MultiVolatileJavaExperiment$Foo;
   26:  iinc    1, 1
   29:  goto    2
   32:  return
  LineNumberTable: 
   line 77: 0
   line 78: 2
   line 79: 10
   line 82: 18
   line 83: 26
   line 85: 32

  StackMapTable: number_of_entries = 2
   frame_type = 252 /* append */
     offset_delta = 2
     locals = [ int ]
   frame_type = 29 /* same */

The more I am experimenting with this, it seems to me that this has nothing to do with volatiles at all - it has something to do with writing to object fields. My hunch is that this is somehow a memory contention issue - something with caches and false sharing, although there is no explicit synchronization at all.

EDIT 2:

Interestingly, changing the program like this:

final class Holder {
    public Foo bar = null;
}

final class Reader extends Thread {
    volatile Foo vfoo = new Foo();
    Holder holder = null;
    int sz;

    public Reader(int _sz) {
        sz = _sz;
    }

    public void run() {
        int i = 0;
        holder = new Holder();
        while (i < sz) {
            vfoo.x = 1;
            holder.bar = vfoo;
            i++;
        }
    }
}

resolves the scaling issue. Apparently, the Holder object above gets created after the thread is started, and is probably allocated in a different segment of memory, which is then being modified concurrently, as opposed to modifying the field bar in the thread object, which is somehow "close" in memory between different thread instances.

axel22
  • 32,045
  • 9
  • 125
  • 137
  • `bar = vfoo` is slow because it's a volatile read. What times do you get with non-volatile vfoo (as opposed to uncommeting the assignment)? – Viruzzo Jan 19 '12 at 13:22
  • 1
    1) `vfoo.x = 1` is also a volatile read, but it's not slow and it scales well. 2) When `vfoo` is nonvolatile, the JIT optimizes the loop away, unless you add a `if (bar != null)` check inside the loop before the `vfoo.x = 1` to counter its effects. If you do this - make `vfoo` nonvolatile and add this check, the same scalability issue remains. – axel22 Jan 19 '12 at 13:30
  • @axel22: I don't see how `vfoo.x = 1` is a volatile read, you are only writing to a non-volatile field. `bar = vfoo` instead reads a volatile reference. – ninjalj Jan 19 '12 at 21:11
  • 1
    Also note that there is another difference: `0xf36c9ffc: xchg %ax,%ax` vs `0xf3771b0e: nop`. In x86, `XCHG` implies a `LOCK`. – ninjalj Jan 19 '12 at 21:13
  • @ninjalj: 1) `vfoo.x = 1` first does a volatile read on the location `vfoo`, then does a write to location `vfoo[x]`. I think the effect should be the same as `Foo tmp = vfoo; tmp.x = 1`. 2) I don't know much about x86 assembly, but I thought that `XCHG` was supposed to be prefixed with a `LOCK` to be locked (atomic). Can you point me to a tutorial or a reference? – axel22 Jan 19 '12 at 21:20
  • @axel22: 1) Can you show the bytecode? (with a bit of luck it will include barriers, so we can make more sense of this). 2) it's common knowledge, there are probably many posts here on SO about it. e.g: http://stackoverflow.com/q/3144335/371250 – ninjalj Jan 19 '12 at 21:48
  • Forget about bytecode, I think **+PrintStubCode** may show something. – ninjalj Jan 19 '12 at 22:01
  • A nice test would be to just remove the `volatile` keyword. If the performance stays the same, then it's surely the write to `bar` the one that is slowing it down. – ninjalj Jan 20 '12 at 19:07
  • I've mentioned it in a comment above - I did remove the `volatile` keyword, and the same performance problem persists. – axel22 Jan 20 '12 at 19:09

5 Answers5

3

This is what I think is happening (keep in mind I'm not familiar with HotSpot):

0xf36c9fd0: mov    0x6c(%ecx),%ebp    ; vfoo
0xf36c9fd3: test   %ebp,%ebp          ; vfoo is null?
0xf36c9fd5: je     0xf36c9ff7         ;   throw NullPointerException (I guess)
0xf36c9fd7: movl   $0x1,0x8(%ebp)     ; vfoo.x = 1
0xf36c9fde: mov    0x68(%ecx),%ebp    ; sz
0xf36c9fe1: inc    %ebx               ; i++
0xf36c9fe2: test   %edi,0xf7725000    ; safepoint on end of loop
0xf36c9fe8: cmp    %ebp,%ebx          ; i < sz?
0xf36c9fea: jl     0xf36c9fd0


0xf3771ad0: mov    0x6c(%ecx),%ebp          ; vfoo
0xf3771ad3: test   %ebp,%ebp                ; vfoo is null?
0xf3771ad5: je     0xf3771b09               ;   throw NullPointerException (I guess)
0xf3771ad7: movl   $0x1,0x8(%ebp)           ; vfoo.x = 1
0xf3771ade: mov    0x6c(%ecx),%ebp          ; \
0xf3771ae1: mov    %ebp,0x70(%ecx)          ; / bar = vfoo
0xf3771ae4: mov    0x68(%ecx),%edi          ; sz
0xf3771ae7: inc    %ebx                     ; i++
0xf3771ae8: mov    %ecx,%eax                ; 
0xf3771aea: shr    $0x9,%eax                ; ??? \ Probably replaced later
0xf3771aed: movb   $0x0,-0x3113c300(%eax)   ; ??? / by some barrier code?
0xf3771af4: test   %edi,0xf77ce000          ; safepoint
0xf3771afa: cmp    %edi,%ebx                ; i < sz ?
0xf3771afc: jl     0xf3771ad0               ;

The reason I think the above code stands in for a barrier is that when taking the NullPointerException, the scalable version has a XCHG, which acts as a barrier, while the non-scalable version has a NOP there.

The rationale would be that there needs to be a happens-before ordering between the initial load of vfoo and joining the thread. In the volatile case, the barrier would be inside the loop, so it wouldn't need to be elsewhere. What I don't understand is why XCHG isn't used inside the loop. Maybe runtime detection of MFENCE support?

ninjalj
  • 42,493
  • 9
  • 106
  • 148
  • Apparently, the `shr`/`movb` instruction pair is exactly the barrier code - it sets the card dirty-byte used by the garbage collector. – axel22 Feb 01 '12 at 13:22
3

Let's try to get the JVM to behave a bit more "consistently." The JIT compiler is really throwing off comparisons of test runs; so let's disable the JIT compiler by using -Djava.compiler=NONE. This definitely introduces a performance hit, but will help eliminate the obscurity and effects of JIT compiler optimizations.

Garbage collection introduces its own set of complexities. Let's use the serial garbage collector by using -XX:+UseSerialGC. Let's also disable explicit garbage collections and turn on some logging to see when garbage collection is performed: -verbose:gc -XX:+DisableExplicitGC. Finally, let's get enough heap allocated using -Xmx128m -Xms128m.

Now we can run the test using:

java -XX:+UseSerialGC -verbose:gc -XX:+DisableExplicitGC -Djava.compiler=NONE -Xmx128m -Xms128m -server -Dsize=50000000 -Dpar=1 MultiVolatileJavaExperiment 10

Running the test multiple times shows the results are very consistent (I'm using Oracle Java 1.6.0_24-b07 on Ubuntu 10.04.3 LTS with an Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz), averaging somewhere about 2050 milliseconds. If I comment out the bar = vfoo line, I'm consistently averaging about 1280 milliseconds. Running the test using -Dpar=2 results with an average about 1350 milliseconds with bar = vfoo and about 1005 milliseconds with it commented.

+=========+======+=========+
| Threads | With | Without |
+=========+======+=========+
|    1    | 2050 |  1280   |
+---------+------+---------+
|    2    | 1350 |  1005   |
+=========+======+=========+

Let's now look at the code and see if we can spot any reasons why multi-threading is inefficient. In Reader.run(), qualifying variable with this as appropriate will help make it clear which variables are local:

int i = 0;
while (i < this.sz) {
    this.vfoo.x = 1;
    this.bar = this.vfoo;
    i++;
}

First thing to notice is the while loop contains four variables referenced through this. This means the code is accessing the class's runtime constant pool and performing type-checking (via the getfield bytecode instruction). Let's change the code to try and eliminate accessing the runtime constant pool and see if we get any benefits.

final int mysz = this.sz;
int i = 0;
while (i < mysz) {
    this.vfoo.x = 1;
    this.bar = this.vfoo;
    i++;
}

Here, we're using a local mysz variable to access the loop size and only accessing sz through this once, for initialization. Running the test, with two threads, averages about 1295 milliseconds; a small benefit, but one nonetheless.

Looking at the while loop, do we really need to reference this.vfoo twice? The two volatile reads create two synchronization edges that the virtual machine (and underlying hardware, for that matter) needs to manage. Let's say we do want one synchronization edge at the beginning of the while loop and we don't need two, we can use the following:

final int mysz = this.sz;
Foo myvfoo = null;
int i = 0;
while (i < mysz) {
    myvfoo = this.vfoo;
    myvfoo.x = 1;
    this.bar = myvfoo;
    i++;
}

This averages about 1122 milliseconds; still getting better. What about that this.bar reference? Since we are talking multi-threading, let's say the calculations in the while loop is what we want to get multi-threaded benefit from and this.bar is how we communicate our results to others. We really don't want to set this.bar until after the while loop is done.

final int mysz = this.sz;
Foo myvfoo = null;
Foo mybar = null;
int i = 0;
while (i < mysz) {
    myvfoo = this.vfoo;
    myvfoo.x = 1;
    mybar = myvfoo;
    i++;
}
this.bar = mybar;

Which gives us about 857 milliseconds on average. There's still that final this.vfoo reference in the while loop. Assuming again that the while loop is what we want multi-threaded benefit from, let's move that this.vfoo out of the while loop.

final int mysz = this.sz;
final Foo myvfoo = this.vfoo;
Foo mybar = null;
int i = 0;
while (i < mysz) {
    myvfoo.x = 1;
    mybar = myvfoo;
    i++;
}
final Foo vfoocheck = this.vfoo;
if (vfoocheck != myvfoo) {
    System.out.println("vfoo changed from " + myvfoo + " to " + vfoocheck);
}
this.bar = mybar;

Now we average about 502 milliseconds; single-threaded test averages about 900 milliseconds.

So what does this tell us? By extrapolating non-local variable references out of the while loop, there has been significant performance benefits both in the single- and double-threaded tests. The original version of MultiVolatileJavaExperiment was measuring the cost of accessing non-local variables 50,000,000 times, while the final version is measuring the cost of accessing local variables 50,000,000 times. By using local variables, you increase the likelihood that the Java Virtual Machine and underlying hardware can manage the thread caches more efficiently.

Finally, let's run the tests normally using (notice, using 500,000,000 loop size instead of 50,000,000):

java -Xmx128m -Xms128m -server -Dsize=500000000 -Dpar=2 MultiVolatileJavaExperiment 10

The original version averages about 1100 milliseconds and the modified version averages about 10 millisecond.

Go Dan
  • 15,194
  • 6
  • 41
  • 65
2

You are not actually writing to a volatile field so the volatile field can be cached in each thread.

Using volatile prevents some compiler optimisations and in a micro-benchmark, you can see a large relative difference.

In the example above, the commented out version is longer because it has loop unrolled to place two iterations in one actual loop. This can almost double performance.

When using volatile you can see there is no loop unrolling.

BTW: You can remove a lot of the code in your example to make it easier to read. ;)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • Thanks, I've extracted the code a bit. But: 1) caching the volatile field in each thread (in registers) should be an asset to scalability, 2) by removing the volatile, the problem persists, as explained in the comment after the question above, 3) the commented (scalable) version is shorter, not longer, 4) while different lengths can affect performance (and they do - approx. 50% in the 1 thread case), I don't see how it could affect scalability. – axel22 Jan 19 '12 at 13:54
  • Shorter code is easier for people answering the question to read. – Peter Lawrey Jan 19 '12 at 14:09
  • @PeterLawrey: I don't see loop unrolling. – ninjalj Jan 19 '12 at 21:53
  • 1
    Its not in the code, the CPU can fill the pipeline with predicted code. If you use a volatile, it stops the pipeline. – Peter Lawrey Jan 19 '12 at 22:07
1

Short: apparently, the answer is false sharing due to card marking for the GC.

A more extensive explanations is given in this question:

Array allocation and access on the Java Virtual Machine and memory contention

Community
  • 1
  • 1
axel22
  • 32,045
  • 9
  • 125
  • 137
1

Edit: This answer did not stand up to testing.

I have no way to test this right now (no multicore CPU in this machine), but here is a theory: The Foo instances might not be in the same cache lines, but perhaps the Reader instances are.

This means the slowdown could be explained by the write to bar, rather than the read of foo, because writing to bar would invalidate that cache line for the other core and cause lots of copying between caches. Commenting out the write to bar (which is the only write to a field of Reader in the loop) stops the slowdown, which is consistent with this explanation.

Edit: According to this article, the memory layout of objects is such that the bar reference would be the last field in the layout of the Reader object. This means it is probable to land in the same cache line as the next object on the Heap. Since I am not sure about the order in which new objects are allocated on the Heap, I suggested in the comment below to pad both "hot" object types with references, which would be effective in separating the objects (At least, I hope it will, but it depends on how fields of the same type are sorted in memory).

Medo42
  • 3,821
  • 1
  • 21
  • 37
  • I would go for this theory, but I assumed initially that the `Reader`, being a `Thread`, has a lot of fields. Still, to check this, I've added 16 32-bit integer fields to it just now. The measurements remain exactly the same. So, assuming that object instances occupy contiguous regions of memory, this shouldn't be the reason. Here's a blog post about memory layouts on the JVM is in favour with this assumption: http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html – axel22 Jan 19 '12 at 14:05
  • Can you try padding both Foo and Reader with some null references (counting 4 or 8 bytes based on your system)? I have a wacky theory there based very much on assumptions, but it might just work. – Medo42 Jan 19 '12 at 14:16
  • I've padded both classes with `Object aX;`, `X` ranging from `0` until `f` (16 refs in total). It didn't change results. – axel22 Jan 19 '12 at 16:46