String representation in Java and compacting Strings

Question

Recently I stumbled upon this JEP 254: Compact Strings which basically targets:

Summary: Adopt a more space-efficient internal representation for strings.

From my current experience, Strings and char[] occupies huge percentage of total heap consumption. Like the JIRA already states:

The current implementation of the String class stores characters in a char array, using two bytes (sixteen bits) for each character. Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused.

Considering this, I have below questions:

How is this issue handled currently by other developers when String is storing only chars requiring 1 byte and also constitute a large part of heap profile?
Why this is being implemented now and have not been attempted a solution for this earlier?
Are there already open source libraries which target solving this issue?

I have gone through basic questions like this and this regarding facts about String which covers How StringPool and interning String works and Why single char in String currently occupies 2 bytes.

I mean only such characters which will need only a byte to store. — Aaditya Gavandalkar, Jan 10 '16 at 07:39
I see. Always cool to invent your own terms. You may be referring to ASCII (essentially 0-127 range of Unicode) or Latin-1 as your post mentions in other part - consider if coming up with your own term is necessary. — Alexei Levenkov, Jan 10 '16 at 07:43

Tagir Valeev · Accepted Answer · 2016-01-10T08:09:40.703

This was actually attempted earlier: in Java-6 there was an option -XX:+UseCompressedStrings which enabled the feature similar to JEP 254. However this feature was dropped in Java-7 due to additional complexity (which introduced bugs like this or this) and performance losses. One of the problems was that these times strings were capable to share the underlying buffer (substring() returned a new string which shared the same buffer with the original string). This added much complexity to string compactization (what if original string uses non-Latin1 symbols while substring uses Latin-1 only?).

Now string buffers are never shared between non-equal strings, so the implementation became easier. Nevertheless it's quite hard and involves many caveats. One of JEP 254 goals is to try very hard in order not to lose even a tiny bit of performace. Dont' forget that String class is very basic: some of its methods (like equals, indexOf) are intrinsified by JIT-compiler; some scenarios are handled specially (like optimization of String concatenation). All of these features heavily rely on the internal String representation and should be rewritten as well for compact strings.

If you want to compactify your current code, you may implement custom CompactString which implements CharSequence interface and uses byte[] internally. The problem is that tons of existing code works with String, not with CharSequence and CharSequence interface is actually very limited. So it would be quite hard to use such class widely.

score 1 · Answer 2 · answered Jan 10 '16 at 07:46

UTF-8 is a character encoding that works for all Unicode characters, Java strings store in UTF-16 encoding instead, and they do that always. Implementing a variable striong storage would likely be a huge performance hit, as the JVM would have to decide first whether it's looking at a Latin-1 string value or a UTF-16 one.

Also, UTF16 encoding provides more consistent handling of string properties and operations. Latin-1 strings would have to be converted to UTF-16 first to append non-Latin-1 characters. Also, comparing a Latin-1 string to a UTF-16 string is quite a messy thing. Essentially you would have to convert the Latin-1 string to a UTF-16 string (or at least iterate over it via the CharSequence interface) for almost all operations.

Kayaman · Answer 3 · 2016-01-10T07:50:38.967

-1

The in-memory representation/encoding of char (and therefore String) in Java is UTF-16 requiring (at least) 2 bytes per any character. Even if you're using characters and encodings in your program that would require a single byte in other charsets (Latin-1, part of UTF-8, etc.).

This issue might not have been most relevant earlier, but now with multi gigabyte heaps and who knows what other reasons they're taking a new look on slimming down the JVM heap signature.

Since this is a JVM internal issue, there are no libraries that could affect it. You would need a custom JVM which could be non-conforming to the rules (assuming the UTF-16 encoding is specified somewhere).

edited Jan 10 '16 at 07:50

answered Jan 10 '16 at 07:43

Kayaman

72,141
5
83
121

I am already aware about that fact and mentioned the same in the question too. Last two are not questions but are stuff I already am aware of. – Aaditya Gavandalkar Jan 10 '16 at 07:45
Then what is your `How is this handled currently` question supposed to mean? – Kayaman Jan 10 '16 at 07:47
I mean how other developer get past this issue? Edited just to be clear if anyone else is confused – Aaditya Gavandalkar Jan 10 '16 at 07:47
There's nothing to get past. If you don't have enough memory for your Strings, then buy more memory. – Kayaman Jan 10 '16 at 07:51
Can't the byte array be used to tackle this in a way? but then we always end up converting it to strings, so not sure. – Aaditya Gavandalkar Jan 10 '16 at 07:51
It's so nice of you to consider everyone else as dumb people and yourself as the smarter one. – Aaditya Gavandalkar Jan 10 '16 at 08:13
I don't consider everyone else dumb. You're just not being very clear in your questions or comments. The first comment to your question was "what do you mean 'only UTF-8 chars'", and your comments to my answer haven't been very understandable either. – Kayaman Jan 10 '16 at 08:18

String representation in Java and compacting Strings

3 Answers3