0

JProfiler shos me 5M char[] instances take 2.5GB. The text itself in those char[] is a total of 1.2GB from a file. The 1.3GB seems overhead from the array instances, their fields like length, any alignment that the JVM might do. But still looks like too much? On the higher level what I keep in memory is a HashMap. JProfiler shows me that:

5M char[]: 2.5GB
5M String: 126MB
2.6M HashMap$Node: 84MB

Please advise, why would the JVM take so much heap overhead for the char[] instances, or perhaps the JProfiler can't do such an accurate reporting on the heap consumption by instances of each class?

NicuMarasoiu
  • 776
  • 9
  • 25
  • What exactly are you storing in the char[] arrays? What does the distribution of array lengths look like? – PiRocks May 22 '20 at 20:06
  • Better still can you give us a MRE? – PiRocks May 22 '20 at 20:08
  • @PiRocks 2.5M have 411 chars on average, ascii i would say; the other 2.5M half have 5 chars each – NicuMarasoiu May 22 '20 at 20:38
  • @PiRocks what is an MRE? A memory snapshot? – NicuMarasoiu May 22 '20 at 20:39
  • MRE=minimal reproducible example. – PiRocks May 22 '20 at 20:56
  • For small arrays keep in mind that arrays need to store some kind of class pointer and a array length value. If both of these are 64 bit that can add a fair bit of overhead. – PiRocks May 22 '20 at 20:58
  • 2
    And this might be a long shot, but you are aware that chars are two bytes in Java right? – PiRocks May 22 '20 at 21:00
  • One final possible explanation that comes to mind is strong internment – PiRocks May 22 '20 at 21:00
  • @PiRocks go ahead and give the answer. I had in mind that in UTF8, the ascii chars are represented as 1 byte. Now I store byte[] instead of Strings and i have the memory consumption in half. Yes, the primitive char is two bytes. Good to clarify. – NicuMarasoiu May 22 '20 at 21:35
  • 3
    @NicuMarasoiu _or_ you can upgrade to java-9 or newer, where Strings ( latin1) are stored as `byte[]`. – Eugene May 23 '20 at 01:24
  • 2
    As @Eugene said the newer JDKs have "Compact Strings" optimization which should cut down the overhead, at least for ASCII strings (as soon as you have a non-ascii character in your string the overhead jumps back). – Juraj Martinka May 23 '20 at 04:06

1 Answers1

1

As stated in the comments this is because chars in java are unsigned 2 byte values. Therefore if you read an ascii file into a String object you should expect roughly a 2x overhead.

As pointed out by @Eugene and @Jurah Martinka, newer VMs have optimizations for this. You can see more info on those optimizations here.

PiRocks
  • 1,708
  • 2
  • 18
  • 29