Java slower with large heap, no excessive gc-ing

Question

Original question

I'm running a Java program, say program.jar, with a "small" initial heap (1gb) and a "large" initial heap (16gb). When I run it with the small initial heap, i.e.,

java -jar -Xms1g -Xmx16g program.jar

the program terminates in 12 seconds (averaged over multiple runs). In contrast, when I run it with the large initial heap, i.e.,

java -jar -Xms16g -Xmx16g program.jar

the program terminates in 30 seconds (averaged over multiple runs).

I understand from other questions at SO that, generally, large heaps may give rise to excessive garbage collecting, thereby slowing down the program:

However, when I run program.jar with flag -verbose:gc, no gc-activity is reported whatsoever with the large initial heap. With the small initial heap, there is only some gc-activity during an initialization phase of the program, before I start measuring time. Excessive garbage collecting therefore does not seem to clarify my observations.

To make it more confusing (for me at least), I have a functionally equivalent program, say program2.jar, which has the same input-output behavior as program.jar. The main difference is that program2.jar uses less efficient data structures than program.jar, at least in terms of memory (whether program2.jar is also less efficient in terms of time is actually what I'm trying to determine). But regardless of whether I run program2.jar with the small initial heap or the large initial heap, it always terminates in about 22 seconds (including about 2-3 seconds of gc-ing).

So, this is my question: (how) can large heaps slow down programs, excessive gc-ing aside?

(This question may seem similar to Georg's question in "Java slower with big heap", but his problem turned out to be unrelated to the heap. In my case, I feel it must have something to do with the heap, as it's the only difference between the two runs of program.jar.)

Here are some details that may be of relevance. I'm using Java 7, OpenJDK:

> java -version
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

My machine has two E5-2690V3 processors (http://ark.intel.com/products/81713) in two sockets (Hyper-Threading and Turbo Boost disabled) and has ample memory (64gb), about half of which is free just before running the program:

> free
             total       used       free     shared    buffers     cached
Mem:      65588960   31751316   33837644         20     154616   23995164
-/+ buffers/cache:    7601536   57987424
Swap:      1023996      11484    1012512

Finally, the program has multiple threads (around 70).

Edit 1

In response to Bruno Reis and kdgregory

I collected some additional statistic. This is for program.jar with a small initial heap:

Command being timed: "java -Xms1g -Xmx16g -verbose:gc -jar program.jar"
User time (seconds): 339.11
System time (seconds): 29.86
Percent of CPU this job got: 701%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:52.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 12192224
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 771372
Voluntary context switches: 7446
Involuntary context switches: 27788
Swaps: 0
File system inputs: 10216
File system outputs: 120
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

This is for program.jar with a large initial heap:

Command being timed: "java -Xms16g -Xmx16g -verbose:gc -jar program.jar"
User time (seconds): 769.13
System time (seconds): 28.04
Percent of CPU this job got: 1101%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:12.34
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 10974528
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 16
Minor (reclaiming a frame) page faults: 687727
Voluntary context switches: 6769
Involuntary context switches: 68465
Swaps: 0
File system inputs: 2032
File system outputs: 160
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

(The wall clock times reported here differ from the ones reported in my original post because of a previously untimed initialization phase.)

Edit 2

In response to the8472's initial answer and later comment

I collected some statistics on caches. This is for program.jar with a small initial heap (averaged over 30 runs):

  2719852136 cache-references                                              ( +-  1.56% ) [42.11%]
  1931377514 cache-misses              #   71.010 % of all cache refs      ( +-  0.07% ) [42.11%]
 56748034419 L1-dcache-loads                                               ( +-  1.34% ) [42.12%]
 16334611643 L1-dcache-load-misses     #   28.78% of all L1-dcache hits    ( +-  1.70% ) [42.12%]
 24886806040 L1-dcache-stores                                              ( +-  1.47% ) [42.12%]
  2438414068 L1-dcache-store-misses                                        ( +-  0.19% ) [42.13%]
           0 L1-dcache-prefetch-misses                                    [42.13%]
    23243029 L1-icache-load-misses                                         ( +-  0.66% ) [42.14%]
  2424355365 LLC-loads                                                     ( +-  1.73% ) [42.15%]
   278916135 LLC-stores                                                    ( +-  0.30% ) [42.16%]
   515064030 LLC-prefetches                                                ( +-  0.33% ) [10.54%]
 63395541507 dTLB-loads                                                    ( +-  0.17% ) [15.82%]
  7402222750 dTLB-load-misses          #   11.68% of all dTLB cache hits   ( +-  1.87% ) [21.08%]
 20945323550 dTLB-stores                                                   ( +-  0.69% ) [26.34%]
   294311496 dTLB-store-misses                                             ( +-  0.16% ) [31.60%]
    17012236 iTLB-loads                                                    ( +-  2.10% ) [36.86%]
      473238 iTLB-load-misses          #    2.78% of all iTLB cache hits   ( +-  2.88% ) [42.12%]
 29390940710 branch-loads                                                  ( +-  0.18% ) [42.11%]
    19502228 branch-load-misses                                            ( +-  0.57% ) [42.11%]

53.771209341 seconds time elapsed                                          ( +-  0.42% )

This is for program.jar with a large initial heap (averaged over 30 runs):

 10465831994 cache-references                                              ( +-  3.00% ) [42.10%]
  1921281060 cache-misses              #   18.358 % of all cache refs      ( +-  0.03% ) [42.10%]
 51072650956 L1-dcache-loads                                               ( +-  2.14% ) [42.10%]
 24282459597 L1-dcache-load-misses     #   47.54% of all L1-dcache hits    ( +-  0.16% ) [42.10%]
 21447495598 L1-dcache-stores                                              ( +-  2.46% ) [42.10%]
  2441970496 L1-dcache-store-misses                                        ( +-  0.13% ) [42.10%]
           0 L1-dcache-prefetch-misses                                    [42.11%]
    24953833 L1-icache-load-misses                                         ( +-  0.78% ) [42.12%]
 10234572163 LLC-loads                                                     ( +-  3.09% ) [42.13%]
   240843257 LLC-stores                                                    ( +-  0.17% ) [42.14%]
   462484975 LLC-prefetches                                                ( +-  0.22% ) [10.53%]
 62564723493 dTLB-loads                                                    ( +-  0.10% ) [15.80%]
 12686305321 dTLB-load-misses          #   20.28% of all dTLB cache hits   ( +-  0.01% ) [21.06%]
 19201170089 dTLB-stores                                                   ( +-  1.11% ) [26.33%]
   279236455 dTLB-store-misses                                             ( +-  0.10% ) [31.59%]
    16259758 iTLB-loads                                                    ( +-  4.65% ) [36.85%]
      466127 iTLB-load-misses          #    2.87% of all iTLB cache hits   ( +-  6.68% ) [42.11%]
 28098428012 branch-loads                                                  ( +-  0.13% ) [42.10%]
    18707911 branch-load-misses                                            ( +-  0.82% ) [42.10%]

73.576058909 seconds time elapsed                                          ( +-  0.54% )

Comparing the absolute numbers, the large initial heap results in about 50% more L1-dcache-load-misses and 70% more dTLB-load-misses. I did a back-of-the-envelope calculation for the dTLB-load-misses, assuming 100 cycle/miss (source: Wikipedia) on my 2.6 ghz machine, which gives a 484 seconds delay for the large initial heap versus a 284 second delay with the small one. I don't know how to translate this number back into a per-core delay (probably not just divide by the number of cores?), but the order of magnitude seems plausible.

After collecting these statistics, I also diff-ed the output of -XX:+PrintFlagsFinal for the large and small initial heap (based on one run for each of these two cases):

<     uintx InitialHeapSize                          := 17179869184     {product}           
---
>     uintx InitialHeapSize                          := 1073741824      {product}

So, no other flags seem affacted by -Xms. Here is also the output of -XX:+PrintGCDetails for program.jar with a small initial heap:

[GC [PSYoungGen: 239882K->33488K(306176K)] 764170K->983760K(1271808K), 0.0840630 secs] [Times: user=0.70 sys=0.66, real=0.09 secs]
[Full GC [PSYoungGen: 33488K->0K(306176K)] [ParOldGen: 950272K->753959K(1508352K)] 983760K->753959K(1814528K) [PSPermGen: 2994K->2993K(21504K)], 0.0560900 secs] [Times: user=0.20 sys=0.03, real=0.05 secs]
[GC [PSYoungGen: 234744K->33056K(306176K)] 988704K->983623K(1814528K), 0.0416120 secs] [Times: user=0.69 sys=0.03, real=0.04 secs]
[GC [PSYoungGen: 264198K->33056K(306176K)] 1214765K->1212999K(1814528K), 0.0489600 secs] [Times: user=0.61 sys=0.23, real=0.05 secs]
[Full GC [PSYoungGen: 33056K->0K(306176K)] [ParOldGen: 1179943K->1212700K(2118656K)] 1212999K->1212700K(2424832K) [PSPermGen: 2993K->2993K(21504K)], 0.1589640 secs] [Times: user=2.27 sys=0.10, real=0.16 secs]
[GC [PSYoungGen: 230538K->33056K(431616K)] 1443238K->1442364K(2550272K), 0.0523620 secs] [Times: user=0.69 sys=0.23, real=0.05 secs]
[GC [PSYoungGen: 427431K->33152K(557568K)] 1836740K->1835676K(2676224K), 0.0774750 secs] [Times: user=0.64 sys=0.72, real=0.08 secs]
[Full GC [PSYoungGen: 33152K->0K(557568K)] [ParOldGen: 1802524K->1835328K(2897920K)] 1835676K->1835328K(3455488K) [PSPermGen: 2993K->2993K(21504K)], 0.2019870 secs] [Times: user=2.74 sys=0.13, real=0.20 secs]
[GC [PSYoungGen: 492503K->33280K(647168K)] 2327831K->2327360K(3545088K), 0.0870810 secs] [Times: user=0.61 sys=0.92, real=0.09 secs]
[Full GC [PSYoungGen: 33280K->0K(647168K)] [ParOldGen: 2294080K->2326876K(3603968K)] 2327360K->2326876K(4251136K) [PSPermGen: 2993K->2993K(21504K)], 0.0512730 secs] [Times: user=0.09 sys=0.12, real=0.05 secs]

Heap
 PSYoungGen  total 647168K, used 340719K [0x00000006aaa80000, 0x00000006dd000000, 0x0000000800000000)
  eden space 613376K, 55% used [0x00000006aaa80000,0x00000006bf73bc10,0x00000006d0180000)
  from space 33792K, 0% used [0x00000006d2280000,0x00000006d2280000,0x00000006d4380000)
  to   space 33792K, 0% used [0x00000006d0180000,0x00000006d0180000,0x00000006d2280000)
 ParOldGen   total 3603968K, used 2326876K [0x0000000400000000, 0x00000004dbf80000, 0x00000006aaa80000)
  object space 3603968K, 64% used [0x0000000400000000,0x000000048e0572d8,0x00000004dbf80000)
 PSPermGen   total 21504K, used 3488K [0x00000003f5a00000, 0x00000003f6f00000, 0x0000000400000000)
  object space 21504K, 16% used [0x00000003f5a00000,0x00000003f5d68070,0x00000003f6f00000)

And for program.jar with a large initial heap:

Heap
 PSYoungGen  total 4893696K, used 2840362K [0x00000006aaa80000, 0x0000000800000000, 0x0000000800000000)
  eden space 4194816K, 67% used [0x00000006aaa80000,0x000000075804a920,0x00000007aab00000)
  from space 698880K, 0% used [0x00000007d5580000,0x00000007d5580000,0x0000000800000000)
  to   space 698880K, 0% used [0x00000007aab00000,0x00000007aab00000,0x00000007d5580000)
 ParOldGen   total 11185152K, used 0K [0x00000003fff80000, 0x00000006aaa80000, 0x00000006aaa80000)
  object space 11185152K, 0% used [0x00000003fff80000,0x00000003fff80000,0x00000006aaa80000)
 PSPermGen   total 21504K, used 3489K [0x00000003f5980000, 0x00000003f6e80000, 0x00000003fff80000)
  object space 21504K, 16% used [0x00000003f5980000,0x00000003f5ce8400,0x00000003f6e80000)

Do you have any info on the amount of page outs/ins you are getting, during both runs (1GB vs 16GB Xms)? — Bruno Reis, Jul 04 '15 at 21:37
"no gc-activity is reported whatsoever" seems suspicious; it indicates that all of the data for your program remains in the "eden" section of the heap. Anything that might possibly need a 16 Gb heap would seem to involve at least a few minor collections. — kdgregory, Jul 04 '15 at 22:06
In addition to what @BrunoReis suggested regard page faults, I would also look at swap stats and resident set size (RSS). I could imagine a *possible* situation in which the thread-local allocation blocks are spread out and cause more work for the memory subsystem, but (1) 70 thread seems too low for that to be a real issue, and (2) I can't imagine that causing a 3x slowdown unless you're suffering from paging. — kdgregory, Jul 04 '15 at 22:11
Here's a link that will help you get per-process swap stats: http://stackoverflow.com/a/7180078 — kdgregory, Jul 04 '15 at 22:16
I added some statistics to the question. Is this sufficient or do you need more? — Peng, Jul 04 '15 at 22:32
How's this for a guess: heaps are usually single memory blocks, at least on some systems; is it possible that the OS on which you are running this takes the extra time to allocate and deallocate the larger heap, which you then don't use? You could try it with successively smaller heap sizes and see if there's a threshold, or root around for tools to monitor OS memory allocation, or run it on different OSs. — arcy, Jul 05 '15 at 00:13
CPU time seems to go up a lot even though wall time just goes up a little. How multi-threaded is your program? And file system reads seem to vary too, does it do any disk IO? How many measurements does "multiple runs" entail exactly? — the8472, Jul 05 '15 at 08:26
@the8472: I added statistics on caches to the question (as you mentioned in your now-removed answer). Do you think the difference in misses explains the difference in performance? I don't exactly understand what you mean by how multithreaded my program is; could you elaborate on what kind of answer you expect to that question? The program does no disk IO. And I must have run these programs at least 15-20 times by now, with the same results. — Peng, Jul 05 '15 at 09:32
@arcy: Perhaps. But I suppose the OS would do this allocation of the whole heap already in the beginning, as part of initializing the JVM, before the program actually runs, and before I start measuring the time? — Peng, Jul 05 '15 at 09:36
Run perf with -r to have multiple runs and see if the differences are statistically significant. If they are print JVM flags with `-XX:+PrintFlagsFinal -XX:+PrintGCDetails` *and diff them* to see if initial heap sizes affect any other parameter. Differences in cache hit rates only make sense if something regarding memory layout changes or GCs happen. — the8472, Jul 05 '15 at 10:42
As for multi-threading. The question is whether your algorithm is multi-threaded. CPU time > wall time indicates that more than one core is utilized. This could either be JIT Compiler threads, GC threads or your own code. This would affect how to diagnose the cause. — the8472, Jul 05 '15 at 10:49
@the8472: I repeated the measurements thirty times and updated the results in the Update 2 section of my question. I also diff-ed the flags: the only difference is the initial heap size. The gc details are different, though, but I'm not sure how to interpret the numbers. But: do you agree that the difference in cache behavior explains the difference in performance? If so, it seems we're trying to answer a follow-up question at this point, namely: why is cache behavior different? And that would be progress :). — Peng, Jul 05 '15 at 12:02
@the8472: Thank you for your clarifying your multithreading question. My program is multithreaded, yes (it creates about 70 threads). — Peng, Jul 05 '15 at 12:04
@Peng I don't know when the allocation is made; how are you measuring the wall time? — arcy, Jul 05 '15 at 12:11
@arcy: Wall time is reported using `time -v`; my own measurements start after the initialization phase of the program. — Peng, Jul 05 '15 at 12:36
another thing, why 16GB? your workload seems to fit into fewer gigabytes. Below 4GB or so you might get another performance boost with unscaled [compressed oops](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops). — the8472, Jul 05 '15 at 13:01
@the8472: You're right, 16gb is more than I need. I sort of coincidentally ran into this and wanted to understand what I was observing. Hopefully, it's of some use to other people as well. — Peng, Jul 05 '15 at 13:19

the8472 · Accepted Answer · 2015-07-05T13:41:15.160

11

Accessing memory costs CPU time. Accessing more memory not only costs linearly more CPU time, it likely increases cache pressure and thus miss rates, costing you super-linearly more CPU-time.

Run your program with perf stat java -jar ... to see the number of cache misses. see Perf tutorial

Example Time Scale of System Latencies
Image source: "Systems Performance: Enterprise and the Cloud Paperback", Brendan Gregg, ISBN: 978-0133390094

Since the initial heap size also affects the eden space size and a smaller eden space triggers a GC this can lead to a more compact heap, which can be more cache-friendly (no temporary start-up objects littering the heap).

To reduce the number of difference between both runs try setting the initial and max young generation size to the same value for both runs so that only the old generation size differs. That should - probably - lead to the same performance.

As an aside: You could also try starting the JVM with huge pages, it might (you need to measure!) get you a few extra % performance by reducing TLB misses further.

Note to future readers: Restricting the new gen size does not necessarily make your JVM faster, it just triggers a GC which happens to make the particular workload of @Peng faster.
Manually triggering a GC after startup would have had the same effect.

edited Jul 05 '15 at 13:41

answered Jul 05 '15 at 07:32

the8472

40,999
5
70
122

My statement about no gc-activity was not precise enough (now fixed). What I meant was that there's no gc-activity after I start measuring time. For the small initial heap, in fact, there is some gc-activity during the initialization phase of the program. I added the gc output to the question (in Edit 2). I didn't realize that this may be important---sorry. When you mentioned that "everything fits into the young generation", I realized that with the small initial heap, 64% of the old generation seems to be occupied versus 0% with the large initial heap, so compacting may play a role here? – Peng Jul 05 '15 at 12:35
*sigh*. yes. very important. without GCing you're mixing short-lived with long-lived, temporary objects on the heap. With GCs you kick out all the short-lived ones and have the long-lived ones side-by-side which may make it more cache-friendly. – the8472 Jul 05 '15 at 12:50
3

That's it! If I set the young generation size to the same value, performance of small/large initial heap becomes comparable. So, to summarize: difference in performance is explained by difference in cache behavior, which is explained by gc behavior, which is explained by the size of the young generation size, which is explained by the initial heap size. Thanks a lot for your help :). – Peng Jul 05 '15 at 13:16

Java slower with large heap, no excessive gc-ing

Original question

Edit 1

Edit 2

1 Answers1

Linked