5

This is Sun JDK 1.6u21, x64.

I have a class for the purpose of experimenting with perm gen usage which contains only a single large string (512k characters):

public class Big0 {
     public String bigString =
         "A string with 2^19 characters, should be 1 MB in size";
}

I check the perm gen usage using getUsage().toString() on the MemoryPoolMXBean object for the permanent generation (called "PS Perm Gen" in u21, although it has slightly different names with different versions, or with different garbage collectors.

When I first reference the class, say by reading Big0.class, perm gen jumps by ~500 KB - that's what I'd expect as the constant pool encoding of the string is UTF-8, and I'm using only ASCII characters.

When I actually create an instance of this class, however, perm gen jumps by ~2 MB. Since this is a 1 MB string in-memory (2 bytes per UTF16 character, certainly no surrogates), I'm confused about why the memory usage is double.

The same effect occurs if I make the string static. If I used final, it fails to compile as I exceed the limit for constant pool items of 65535 bytes (not sure why leaving final off avoids that either - consider that a bonus question).

Any insight appreciated!

Edit: I should also point out that this occurs with non-static, final non-static, and static strings, but not for final static strings. Since that's already a best practice for string constants, maybe this is of mostly academic interest.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • what if you ran system.gc a bunch of times after you create an instance of this class in an effort to clear out all non-necessary cruft from permgen, e.g., is there a fleeting temporary footprint in permgen that leads us to incorrectly conclude there's a higher impact. – Ron Feb 23 '11 at 03:20
  • I did that, no effect unfortunately. I also did the ultimate test - filled up germgen - the app OOMed with it full of these 2.5 MB blocks, without recovering any, so we can pretty much assume they cannot be collected in the current implementation. – BeeOnRope Feb 23 '11 at 04:38
  • Is it possible that there are two copies of the string when you make that assignment? One for the literal between the quotes (all string are immutable) and one stored in "bigString". Because "bigString" has a strong reference to the literal, Garbage Collection isn't destroying the first copy (the one to the right of the equals sign). The reason why final and static are working is because the compiler is creating a phantom reference. This is low level stuff for me, so I'm hesitant to post it as an answer. – Jason Sperske Feb 23 '11 at 18:28

4 Answers4

2

I think it's an artefact of your test class. I created a similar class, then decompiled it with javap.

The [eclipse] java compiler breaks the String literal into chunks, each no longer than 64k. The bytecode for initializing the non-constant field consists of cobbling the source string together with a sequence of StringBuilder operations. Although it is this final gigantic string that is interned, the large atoms it is made of take up space in the constant pool.

Ron
  • 1,932
  • 20
  • 17
  • That makes a heck of a lot of sense. I found also that 1 MB of the 2.5 MB is recoverable, if all instances are garbage (non-static case as above), and in that case I guess it's the final string which is released to save that, but the atoms are left behind. – BeeOnRope Feb 23 '11 at 20:12
  • Bonus question: How does static final differ? In this case it only used 1.5 MB. Are the chunks discarded in this case - or is the method completely different? – BeeOnRope Feb 23 '11 at 20:13
  • static final (and private non-static final) permit the java compiler to represent the string solely as a constant in the constant pool. I used jmap -histo:live to measure the size of my constant pool for each of the test cases. YMMV IANAL FWIW. – Ron Feb 23 '11 at 22:00
0

Java characters have a width of 2 bytes per character (regardless of whether itd ASCII or a code point above 255). I think that what you seeing is the Java VM translating the internal class file storage (modified UTF8) version of the string into its internal expanded form as soon as the class is initialized (which is done prior to instance creation)

Dirk
  • 30,623
  • 8
  • 82
  • 102
  • Sure, I accounted for that. My strings were 512k characters, so I would expect them to be 1 MB in-memory (2 bytes per character). – BeeOnRope Feb 21 '11 at 10:42
  • Note also that this doesn't occur at class init in my example above. If I access the class, but don't create an instance, the memory footprint never goes above 500k. Only when I create an instance of my class does it jump another 2 MB. – BeeOnRope Feb 21 '11 at 10:44
0

While the class file format specifies modified UTF-8 as its storage format for String literals, the internal format of the runtime is UTF-16. A String stores its data as in UTF-16 encoding in a char[] (usually, it's implementation-dependent, however) . Most characters take up 2 bytes in this encoding (characters outside the BMP take up more).

I've seen references to a modified rt.jar that contains a java.lang.String implementation with a specialized code-path/storage for ASCII-only Strings, which cut down on the memory requirement significantly.

Edit: it seems this option has found its way into the normal Oracle JRE since Java 6 Update 21 according to this reference:

-XX:-XX:+UseCompressedStrings

Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

(Found through this answer).

Community
  • 1
  • 1
Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
  • Sure - but see my numbers above: I am well aware of the storage of characters at runtime. I would expect a 512k character string to take 1 MB, but actually 2 MB are used. – BeeOnRope Feb 21 '11 at 10:44
0

A good memory profiler (i personally use and really like yourkit java profiler) should be able to show you where the memory is being used.

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
  • I'd like to think so too - I tried MAT, but information on permgen is lacking. In fact, they document that information on interned strings is unfortunately not even available from dumps. – BeeOnRope Feb 22 '11 at 01:12