3

I am working with array SomeObject[] someObject = new SomeObject[5000];.

Each instance inside this array is unique to each other instance but, as a statistical expectation, they require the same computational overhead. In addition the standard deviation of the distribution of computational overhead for each someObject[i] is very narrow (each someObject[i] should take 8 - 12 nanoseconds each, mostly clustering around 10 nanoseconds).

What I have noticed is that if I fill this array with the identical instance sharedObject = new SomeObject(), and loop over SomeObject[], it'll be really fast. For example:

 SomeObject sharedObject = new SomeObject();
 for(int i = 0; i < 5000 ; ++i) {
     someObjects[i] = sharedObject;
 }
 for(int i = 0; i < 5000 ; ++i) {
     someObjects[i].doSomething(); //runs really fast.
 }

But if I fill the array as:

 for(int i = 0; i < 5000 ; ++i) {
     someObjects[i] =  new SomeObject();
 }
 for(int i = 0; i < 5000 ; ++i) {
     someObjects[i].doSomething(); //runs really slow.
 }

then it'll loop over someObject 8 times more slowly, even though each instance has the same computational overhead as each other (within 1 or 2 nanoseconds) when isolated.

Why does this occur and how do I stop this explosion of computational cost occurring?

All computations within each instance of SomeObject are only over the internal state of that object.

Note that the space on the heap is truly minimal for each new SomeObject(), so it is not forcing garbage collection.

Here's some code that recreates the problem (but nowhere near as severe as my real life example):

public class Test2 {

    public final double random;
    public final Double[] sub;

    public Test2(double random) {
        this.random = random;

        sub = new Double[1000];
        for(int i = 0 ; i < 1000 ; ++i) {
            sub[i] = Math.random();
        }
    }

    public void doSomething() {
        for(int i = 0 ; i < 1000 ; ++i) {
            double j = (i - 1000)*5*6/4/4/4*sub[i];
        }
    }



    public static void main(String[] args) {
        Test2 testFirst = new Test2(Math.random());
        Test2[] testFirstArray = new Test2[10000];
        Test2[] testSecondArray = new Test2[10000];

        for(int i = 0 ; i < 10000 ; ++i) {
            testFirstArray[i] = testFirst;
            testSecondArray[i] = new Test2(Math.random());
        }

        //Warmup
        for(int i = 0 ; i < 10000 ; ++i) {
            testFirstArray[i].doSomething();
            testSecondArray[i].doSomething();
        }

        //Time first class
        double firstTimer = System.nanoTime();
        for(int i = 0 ; i < 10000 ; ++i) {
            testFirstArray[i].doSomething();
        }
        System.out.println((System.nanoTime() - firstTimer)/(1000*1000));


        //Time second class
        double secondTimer = System.nanoTime();
        for(int i = 0 ; i < 10000 ; ++i) {
            testSecondArray[i].doSomething();
        }
        System.out.println((System.nanoTime() - secondTimer)/(1000*1000));


    }
}
trincot
  • 317,000
  • 35
  • 244
  • 286
user2763361
  • 3,789
  • 11
  • 45
  • 81
  • Well, it's hard to know what JIT is doing, but since in the first example you process 5000 objects and in the second example a single object 5000 times, there's probably optimizations available. Are you benchmarking things correctly? – Kayaman Feb 10 '14 at 08:21
  • @Kayaman I am using `System.nanoTime()`. I can loop to `1000000` and the same effect happens. What optimizations are there? – user2763361 Feb 10 '14 at 08:22
  • "*even though each instance has the same computational overhead as each other (within 1 or 2 nanoseconds)*" => to measure such short computations, you need to be very carfeful in preparing your micro-benchmark. My guess is that you are not measuring properly. – assylias Feb 10 '14 at 08:30
  • @assylias I can scale up a lot and the effect remains (e.g. make the internal computations more rigorous/long) so that each gives 1 us. – user2763361 Feb 10 '14 at 08:31
  • 1
    Benchmarking is a sensitive art, so unless you actually know what you're doing, you'll end up with weird results due to things like the JIT kicking in etc. As a guess, at least one of the optimizations happening is that the single object is cached. – Kayaman Feb 10 '14 at 08:34
  • @user2763361 [So many things can go wrong when micro-benchmarking](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java)... Until you show a complete example that reproduces the behaviour, the answers will only be based on speculation. – assylias Feb 10 '14 at 08:49
  • @user2763361 cache locality sounds like a plausible explanation. If you are on linux you can use [`perf`](https://perf.wiki.kernel.org/index.php/Tutorial) to check if it is the case. See for example: http://stackoverflow.com/questions/10082517/simplest-tool-to-measure-c-program-cache-hit-miss-and-cpu-time-in-linux – assylias Feb 10 '14 at 09:29
  • The `double j =` line in doSomething doesn't do anything. It has no side effects. As a result, the JIT is going to optimize it out. Put the result in an array or something. It may help. If something is faster than it should be, than it's probably not happening at all. – allprog Feb 10 '14 at 09:31
  • @allprog it's just an example. My `SomeObject` is different (and proprietary, hence can't post). – user2763361 Feb 10 '14 at 09:31
  • Ok, it's a bad example then. :) – allprog Feb 10 '14 at 09:32
  • Read my first comment. – allprog Feb 10 '14 at 09:33
  • @allprog Why does it matter that it doesn't do anything. I recreated the much higher latency. – user2763361 Feb 10 '14 at 09:33
  • This is not a good forum to discuss such questions. You should go ahead and read about the internals of the JVM. It is an incredibly complex piece of software and if you want to benchmark it, you need to do it wisely. Brute force is highly probable to result in unpredictable results. This topic has lots of literature, so you should familiarize yourself with that first. – allprog Feb 10 '14 at 09:37

2 Answers2

1

Basically what you are doing is know as the MicroBenchmarks Since HotSpot

starts by running your program with an interpreter. When it discovers that some method is "hot" -- that is, executed a lot, either because it is called a lot or because it contains loops that loop a lot -- it sends that method off to be compiled. After that one of two things will happen, either the next time the method is called the compiled version will be invoked (instead of the interpreted version) or the currently long running loop will be replaced, while still running, with the compiled method.

Go through following for detals

Jabir
  • 2,776
  • 1
  • 22
  • 31
0

You did not show where you put System.nanoTime(). If the range includes the first loop, then the time to create 5000 new objects is added.

Then, iterating over different objects require loading each object from main memory, while in the first case the same object is taken from the processor cache.

Alexei Kaigorodov
  • 13,189
  • 1
  • 21
  • 38
  • `System.nanoTime()` does not include instantiation. Only the for loop with the `doSomething`. – user2763361 Feb 10 '14 at 08:33
  • How would I test the CPU cache hypothesis? – user2763361 Feb 10 '14 at 08:46
  • declare all fields in SomeObject as volatile, so that they would not be taken from cache, but from the main memory. Both variants should run in roughly the same time. – Alexei Kaigorodov Feb 10 '14 at 08:51
  • @AlexeiKaigorodov: No, they'll be taken from the cache anyway. Just the CPU will have to ensure that the cache is in sync with main memory, but this is cheap when not contended. The times will still differ. – maaartinus Feb 10 '14 at 11:15