0

I got to know that there is an improvement in String representation in java 9. I found some information on that as well. I would like to know more about this, and how exactly the memory usage is reduced and garbage collection is improved as compared to previous java versions. There are a couple of questions related to this topic only but am not fully convinced by the answers over there. Thanks in advance.

Ashish Lohia
  • 269
  • 1
  • 12
  • 2
    This also https://www.javagists.com/compact-strings-java-9 – soorapadman Aug 03 '17 at 03:59
  • 1
    https://www.sitepoint.com/inside-java-9-part-ii/ – soorapadman Aug 03 '17 at 04:38
  • @soorapadman I already mentioned that there are question related to this topic. If you read closely I am more concerned about the memory management and garbage collection. – Ashish Lohia Aug 03 '17 at 07:15
  • 2
    Honestly, I do not understand why this questions gets upvotes. It just goes "I read some stuff that wasnt helpful" - without any links to that material, or facts explaining why that content isn't sufficient. – GhostCat Aug 03 '17 at 07:18

2 Answers2

11

In Java 8

In Java 8 String has a field char[] value - a char takes up two bytes because it represents a full UTF-16 code unit. Oracle's analysis of many, many heap dumps came to two conclusions:

  • these arrays occupy somewhere between 20 % and 30 % of an average application’s live data (including headers and pointers)
  • the overwhelming majority of strings only require ISO-8859-1 (also called Latin-1), which is a single byte.

Refactoring String to only use one byte for Latin-1 could hence save about 10 % to 15 % memory and improve run time performance by reducing garbage allocation.

In Java 9

In Java 9 String is backed by a byte[] value, so that UTF-8 characters can use only a single byte. But what happens if a string uses both UTF-8 and UTF-16 (or even UTF-32) characters?

This may sound like a case for variable-sized records like UTF-8, where the distinction between one and two bytes is made per character. But then there would be no way to predict for a single character which array slots it will occupy, thus requiring random access (e.g. charAt(int)) to perform a linear scan. Degrading random access performance from constant to linear time was an unacceptable regression.

Instead, either each character can be encoded with a single byte, in which case this is the chosen representation, or if at least one of them requires two, two bytes will be used for all of them. A new field coder denotes how the bytes encode characters and many methods in String evaluate it to pick the correct code path.

Here’s how that looks in a simplified version of the String constructor:

// `char[] value` is the argument
if (COMPACT_STRINGS) {
    byte[] val = StringUTF16.compress(value);
    if (val != null) {
        this.value = val;
        this.coder = LATIN1;
        return;
    }
}
this.coder = UTF16;
this.value = StringUTF16.toBytes(value);

There are a couple of things to note here:

  • The boolean flag COMPACT_STRINGS, which is the implementation of the command line flag XX:-CompactStrings and with which the compression can be disabled.
  • The utility class StringUTF16 is first used to try and compress the value array to single bytes and, should that fail and return null, convert it to double bytes instead.
  • The coder field is assigned the respective constant that marks which case applies.

If you want to learn more about compact strings (and indifyied string concatenation), have a look at JEP 254 or Aleksey Shipilev’s talk (it also includes some benchmarks).

valiano
  • 16,433
  • 7
  • 64
  • 79
Nicolai Parlog
  • 47,972
  • 24
  • 125
  • 255
  • 1
    You are mixing up encoding and charset. All three encodings, UTF-8, UTF-16, and UTF-32 use the same charset (Unicode) and therefore can represent the same characters. So the question “*what happens if a string uses both UTF-8 and UTF-16 (or even UTF-32) characters?*” makes no sense. The correct statement would be that Java 9‘s strings encode their content as iso-latin-1 using a single byte per character when possible (as the follow-up code snippet suggests) and use UTF-16 otherwise. – Holger Feb 06 '18 at 14:14
2

Java 9 is bringing in with String optimizations. Java 9 is coming with a feature JEP 254 (Compact Strings).

"Instead of having char[] array, String is now represented as byte[] array. Depending on which characters it contains, it will either use UTF-16 or Latin-1, that is – either one or two bytes per character. There is a new field inside the String class – coder, which indicates which variant is used. Unlike Compressed Strings, this feature is enabled by default. If necessary (in a case where there are mainly UTF-16 Strings used), it can still be disabled by -XX:-CompactStrings.

The change does not affect any public interfaces of String or any other related classes. Many of the classes were reworked to support the new String representation, such as StringBuffer or StringBuilder."

http://openjdk.java.net/jeps/254

Description

We propose to change the internal representation of the String class from a UTF-16 char array to a byte array plus an encoding-flag field. The new String class will store characters encoded either as ISO-8859-1/Latin-1 (one byte per character), or as UTF-16 (two bytes per character), based upon the contents of the string. The encoding flag will indicate which encoding is used.

String-related classes such as AbstractStringBuilder, StringBuilder, and StringBuffer will be updated to use the same representation, as will the HotSpot VM's intrinsic string operations.

This is purely an implementation change, with no changes to existing public interfaces. There are no plans to add any new public APIs or other interfaces.

The prototyping work done to date confirms the expected reduction in memory footprint, substantial reductions of GC activity, and minor performance regressions in some corner cases.

For further detail, see

http://cr.openjdk.java.net/~shade/density/state-of-string-density-v1.txt

http://cr.openjdk.java.net/~huntch/string-density/reports/String-Density-SPARC-jbb2005-Report.pdf

https://www.youtube.com/watch?v=wIyeOaitmWM

https://www.infoq.com/news/2016/02/compact-strings-Java-JDK9

vaquar khan
  • 10,864
  • 5
  • 72
  • 96