15

A question relating to performance considerations for String.substring. Prior to Java 1.7.0_06, the String.substring() method returned a new String object that shared the same underlying char array as its parents but with different offset and length. To avoid keeping a very large string in memory when only a small substring was needed to be kept, programmers used to write code like this:

s = new String(queryReturningHugeHugeString().substring(0,3));

From 1.7.0_06 onwards, it has not been necessary to create a new String because in Oracle's implementation of String, substrings no longer share their underlying char array.

My question is: can we rely on Oracle (and other vendors) not going back to char[] sharing in some future release, and simply do s = s.substr(...), or should we explicitly create a new String just in case some future release of the JRE starts using a sharing implementation again?

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Klitos Kyriacou
  • 10,634
  • 2
  • 38
  • 70
  • 1
    Not an answer exactly, but a very good answer over here http://stackoverflow.com/a/20275133/2796832 may help. Perhaps if you really need to you could use the `getValueLength` from that answer and then use that to flag your code. – Jonah Graham Nov 24 '15 at 12:31
  • 2
    @JonahGraham, bad advice. This may break already in Java-9: it's expected that `char[]` array will be replaced with `byte[]` there. Accessing private JDK fields via reflection is not a good idea in general. – Tagir Valeev Nov 24 '15 at 12:39
  • @TagirValeev TBH I had been contemplating putting that comment as an answer, but it made me feel uncomfortable. I am sure the OP did detailed analysis to ensure that the extra complication in their code was not a premature optimization. However, I am not sure ever future reader would take the same care and attention. Anyway, your answer is a good one +1 – Jonah Graham Nov 24 '15 at 13:45

1 Answers1

9

The actual representation of the String is an internal implementation detail, so you can never be sure. However according to public talks of Oracle engineers (most notably @shipilev) it's very unlikely that it will be changed back. This was done not only to fight with possible memory leak, but also to simplify the String internals. With simpler strings it's easier to implement many optimization techniques like String deduplication or Compact Strings.

Community
  • 1
  • 1
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • Actually, the existence of String de-duplication also shows that we don’t need to ever worry about substring representation anyway. If a JVM’s garbage collector is capable of patching strings to let equal instances share the same array, it is not too far fetched to assume that it would be capable of patching substrings as well if a JVM vendor ever decides to go back to the shared substring (offset+length) representation. – Holger Nov 24 '15 at 18:53