Why does a String-casting loop seem to have a static overhead?

Question

Background

I've been running a code (posted at the bottom) to measure performance of explicit Java downcasting, and I've run into what I feel like is a bit of an anomaly... or perhaps two anomalies.

I have already looked at this thread on Java casting overhead, but it seemed to only talk about casting in general, rather than this particular phenomenon. This thread covered similar topics, and I don't really need advice on premature optimization--I'm tuning parts of my application for maximum performance, so this is the logical step.

The tests

I basically wanted to test the performance of downcasting versus the .toString() method for objects which were Strings, but were typed as Objects. So, I created a String a and an Object b with equivalent content, ran three loops, and timed them.

Loop 1 is ((String) b).toLowerCase();
Loop 2 is b.toString().toLowerCase();
and Loop 3 is a.toLowerCase().

Test results

(Measurements in milliseconds.)

   iters   |  Test Round  |  Loop 1  |  Loop 2  |  Loop 3
-----------|--------------|----------|----------|----------
50,000,000 |      1       |   3367   |   3166   |   3186
  Test A   |      2       |   3543   |   3158   |   3156
           |      3       |   3365   |   3155   |   3169
-----------|--------------|----------|----------|----------
 5,000,000 |      1       |    373   |    348   |    369
  Test B   |      2       |    373   |    348   |    370
           |      3       |    399   |    334   |    371
-----------|--------------|----------|----------|----------
  500,000  |      1       |    66    |    36    |    33
  Test C   |      2       |    71    |    36    |    41
           |      3       |    66    |    35    |    34
-----------|--------------|----------|----------|----------
  50,000   |      1       |    27    |     5    |     5
  Test D   |      2       |    27    |     6    |     5
           |      3       |    26    |     5    |     5
-----------|--------------|----------|----------|----------

Code used for tests

long t, iters = ...;

String a = "String", c;
Object b = "String";

t = System.currentTimeMillis();
for (int i = 0; i < iters; i++) {
    c = ((String) b).toLowerCase();
}
System.out.println(System.currentTimeMillis() - t);

t = System.currentTimeMillis();
for (int i = 0; i < iters; i++) {
    c = b.toString().toLowerCase();
}
System.out.println(System.currentTimeMillis() - t);

t = System.currentTimeMillis();
for (int i = 0; i < iters; i++) {
    c = a.toLowerCase();
}
System.out.println(System.currentTimeMillis() - t);

Finally, the question

What I find most fascinating is that loop 2 (.toString()) seemed to perform the best of the three (particularly in Test B)--which doesn't make intuitive sense. Why would calling .toString() be faster than already having a String object?

The other thing that bothers me is that it does not scale. If we compare Tests A and D, they are off by a factor of 9 when compared against each other (27 * 1000 = 27000, not 3000); why is there this huge discrepancy as the number of iterations scales?

Can anyone offer an explanation as to why these two anomalies have proven to be true?

The (weird) reality

Update: Upon advice of Bruno Reis's solution, I ran my benchmarks again with some compiler output. The first loop was crammed with initialization stuff, so I put in a "junk" loop to do this. Once that was done, the results were closer to expected.

Here is a full output from the console using 5,000,000 iterations (commented by me):

     50    1             java.lang.String::toLowerCase (472 bytes)
     50    2             java.lang.CharacterData::of (120 bytes)
     53    3             java.lang.CharacterDataLatin1::getProperties (11 bytes)
     53    4             java.lang.Character::toLowerCase (9 bytes)
     54    5             java.lang.CharacterDataLatin1::toLowerCase (39 bytes)
     67    6     n       java.lang.System::arraycopy (0 bytes)   (static)
     68    7             java.lang.Math::min (11 bytes)
     68    8             java.util.Arrays::copyOfRange (63 bytes)
     69    9             java.lang.String::toLowerCase (8 bytes)
     69   10             java.util.Locale::getDefault (13 bytes)
     70    1 %           Main::main @ 14 (175 bytes)
[GC 49088K->360K(188032K), 0.0007670 secs]
[GC 49448K->360K(188032K), 0.0024814 secs]
[GC 49448K->328K(188032K), 0.0005422 secs]
[GC 49416K->328K(237120K), 0.0007519 secs]
[GC 98504K->352K(237120K), 0.0122388 secs]
[GC 98528K->352K(327552K), 0.0005734 secs]
    595    1 %           Main::main @ -2 (175 bytes)   made not entrant
548 /****** Junk Loop ******/
    597    2 %           Main::main @ 61 (175 bytes)
[GC 196704K->356K(327552K), 0.0008460 secs]
[GC 196708K->388K(523968K), 0.0005100 secs]
343 /****** Loop 1 ******/
    939    2 %           Main::main @ -2 (175 bytes)   made not entrant
    940   11             java.lang.String::toString (2 bytes)
    940    3 %           Main::main @ 103 (175 bytes)
[GC 393092K->356K(523968K), 0.0036496 secs]
377 /****** Loop 2 ******/
   1316    3 %           Main::main @ -2 (175 bytes)   made not entrant
   1317    4 %           Main::main @ 145 (175 bytes)
[GC 393060K->332K(759680K), 0.0008326 secs]
320 /****** Loop 3 ******/

Try running the loops and tests in a different sequence. Any differences? E.g., what happens if you put Loop 1 last in Test D? — DWright, Jan 23 '13 at 03:54
Java is very good at optimization, and loops behavior may not look the same way at all after being compiled. — Déjà vu, Jan 23 '13 at 03:54
@DWright Excellent call. There is definitely some stuff going on in the first loop. — Cat, Jan 23 '13 at 03:58
Yeah, now that I've seen some of the other comments, it may have something to do with the JVM still spinning up during the first loop, regardless of what the loop is? — DWright, Jan 23 '13 at 04:00
@DWright Yup! It's that, for sure. Have a look at the output for yourself. ;) — Cat, Jan 23 '13 at 04:02
This output sounds like `-XX:+PrintCompilation` and `-verbose:gc`, you can see a lot of GC activity (since strings are immutable and many of the calls inside the loops will produce new strings that will be discarded -- although it runs quite fast!) — Bruno Reis, Jan 23 '13 at 04:06
Just `-XX:+PrintCompilation -verbose:gc`. (There are more I could have used, I'm sure.) — Cat, Jan 23 '13 at 04:06
@Eric, you could add some class loading information to the mix, but I doubt you'd get any more relevant information for this benchmark. The classes are likely to be already loaded before the start of the first loop. — Bruno Reis, Jan 23 '13 at 04:08
@BrunoReis You're absolutely right; they all load (except `sun.misc.Cleaner`) before the junk loop even begins. — Cat, Jan 23 '13 at 04:11
Glad it worked. Before I leave, just a hint that might help: usually HotSpot will optimize methods after they've been called 20k (or 10k, depending on client/server vm) times. So, to warmup, generally copying and pasting the code you'd like to benchmark inside a loop that calls your methods 20k times is exactly what you need. :) — Bruno Reis, Jan 23 '13 at 04:15
@BrunoReis That's good to know, I've done a lot of benchmarks in Java... I'm disappointed they were all bull. Thanks for all your help! :) — Cat, Jan 23 '13 at 04:17

score 7 · Accepted Answer · edited May 23 '17 at 12:27

7

The benchmark is flawed, as most questions in SO and elsewhere related to benchmarking Java code. You are measuring much much much more things than you imagine, such as the JIT compiling methods, HotSpot optimizing loops, etc.

Check http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html .

Also, server VM and client VM behave differently (the JVM starts up faster in client but runs slower for some time, because it starts interpreting the bytecode while it gets compiled, while the server VM compiles it before running), etc.

The GC might also be interfering, more so if you get any Full GC during your benchmark (usually Full GCs completely pause every other thread). Even minor collections might have some impact, since they can use quite a bit of CPU to clean up a possibly huge mess created inside the loops.

To do a proper benchmark you should "warm up" the JVM, turn on lots of output from the JVM to be sure about what you are measuring, etc.

Check this question here on SO which addresses how to write benchmarks in Java, including the topics I mentioned above and much more in great detail: How do I write a correct micro-benchmark in Java?

edited May 23 '17 at 12:27

Community

1
1

answered Jan 23 '13 at 03:47

Bruno Reis

37,201
11
119
156

+1 for an interesting link to an interesting article, (and +1 for an interesting question!), but I think it would help if you gave some specific thoughts on how this benchmark instance is flawed. I skimmed the article, and it was not fully apparent to me how it applies here. For example the whole synchronization thing was a mistake in the article that does not apply here. What parts do you think are applicable here? – DWright Jan 23 '13 at 03:51
1

Ok, some of your specifics came through while I was writing the comment. These help. – DWright Jan 23 '13 at 03:52
3

Great answer, and great links. When I re-ran the benchmark with some compiler output, it showed that a bunch of compiler stuff was getting done on the first loop. When I added a junk loop in there, the results are more what I expected. I've edited my initial post with the console output--fascinating stuff. – Cat Jan 23 '13 at 03:57

score 2 · Answer 2 · answered Jan 23 '13 at 04:55

Why would calling .toString() be faster than already having a String object?

Looking at the numbers, I don't see that Loop 2 is consistently faster than Loop3 at all. Indeed, in some cases it is slower. The apparent significant difference in Test B could simply be that the GC runs once more in the Loop 3 case than the Loop 2 case. This could simply be an artefact of the benchmark design.

Anyway, if you really want to find out what (if anything) is going on, you need to look at the native instructions that are generated by the JIT compiler in each case. (There are JVM options to do this ...)