Could JVM be smarter about sharing String data?

Question

Take a look at this test

    String s1 = "1234";
    String s2 = "123";
    Field field = String.class.getDeclaredField("value");
    field.setAccessible(true);
    char[] value1 = (char[]) field.get(s1);
    char[] value2 = (char[]) field.get(s2);
    System.out.println(value1 == value2);

It prints false and it means that the JVM holds two different char arrays for s1 and s2. Can anybody explain the reason why s1 and s2 cannot share the same char array? It seems like java.lang.String was designed for content sharing, isn't it?

Note: I don't know about all JVMs. This is Oracle's Java HotSpot(TM) Client VM 22.1-b02 (JRE 1.7).

UPDATE

On the other hand, if partial sharing is rare (it seems it's only for Strings created by String.substring) then why should all Strings have int count and int offset fields? It is 8 useless bytes. And this is not only the size, it is also the creation speed. The bigger the object the longer its initialization. Here's a test

    long t0 = System.currentTimeMillis();
    for (int i = 0; i < 10000000; i++) {
        new String("xxxxxxxxxxxxx");
    }
    System.out.println(System.currentTimeMillis() - t0);

it takes ~200ms. If I use this class

class String2 {
    char[] value;
    String2(String2 s) {
        value = s.value;
    }
}

it takes ~140 ms.

Why should it be the same array? 123 and 1234 are different values. — Wojciech Górski, Dec 10 '12 at 15:38
Open java.lang.String src, you will see that String has 3 fields char[] value, int offset, int count. Thus, if there is a "1234" in the JVM string constant pool, s2 could point its value to this char array, set offset = 0; count = 3. No need for a new char array. Makes sense? — Evgeniy Dorofeev, Dec 10 '12 at 15:43
Makes sense, that would be a nice optimization in terms of memory consumption. However, when creating a new string, the JVM would have to search all string instances and try to find a substring that matches the one it's trying to allocate. That sounds like a massive overhead and IMHO outweighs the benefits you gain from pointing to the same array in memory. — Wojciech Górski, Dec 10 '12 at 15:49
Well, I intentionally made s1 and s2 start with the same characters "123". If you write a program yourself would it be any overhead if you wanted to test, before creating a new char array for s2, if there is a string starting with "123" in your pool? — Evgeniy Dorofeev, Dec 10 '12 at 15:55
Assuming you don't create any index-like structures (which would also require some memory) you would have to go through every string instance you have already allocated, so the computational complexity would be o(n^2) (number of allocated strings * number of letters in string). Not very nice. — Wojciech Górski, Dec 10 '12 at 16:02
It also requires some memory to hold "1234" and "123" as separate arrays in the pool, isn't it? — Evgeniy Dorofeev, Dec 10 '12 at 16:07
Of course, I don't question it. What I'm saying is that the time needed to find a string in memory that you could use is probably not worth the gains in terms of memory usage. — Wojciech Górski, Dec 10 '12 at 16:09
Right, it's possible, cannot count o(n^2). Just afraid someone has made a mistake... — Evgeniy Dorofeev, Dec 10 '12 at 16:17

score 5 · Accepted Answer · edited May 23 '17 at 12:03

5

Can anybody explain the reason why s1 and s2 cannot share the same char array?

They can, they just don't, probably because the JVM start-up time would be impacted by looking through the string intern pool for partial matches.

It's worth noting that with non-interned strings, they can share a char array, in certain cases:

String s1 = "1234";
String s2 = s1.substring(0, 3);

...at least through OpenJDK 6. Apparently, in OpenJDK7 they don't share anymore (thank you Marko Topolnik for teaching me that here).

And interestingly, Sun's JVM 1.6 separates them if you intern:

String s1 = "1234";
String s2 = s1.substring(0, 3);
Field field = String.class.getDeclaredField("value");
field.setAccessible(true);
char[] value1 = (char[]) field.get(s1);
char[] value2 = (char[]) field.get(s2);
System.out.println(value1 == value2);
s2 = s2.intern();
value2 = (char[]) field.get(s2);
System.out.println(value1 == value2);

I get:

true
false

I guess it doesn't like having strings in the intern pool that are subsets of other strings.

edited May 23 '17 at 12:03

Community

1
1

answered Dec 10 '12 at 15:39

T.J. Crowder

1,031,962
187
1,923
1,875

+1. Also worth mentioning, for string literals only - it can be done by a compiler, the jvm doesn't even need to know about this optimization. – amit Dec 10 '12 at 15:49
I intentionally took the simplest case when both s1 and s2 start with the same characters "123". I cannot belive that there can be any start-up time impact in this situation. – Evgeniy Dorofeev Dec 10 '12 at 15:49
@amit: Interning is done by the JVM, not by the compiler. But yes, the compiler could emit code turning Evgeniy's `String s2 = "123";` into `String s2 = s1.substring(0, 3);` instead, but I can't imagine it would be a good idea. – T.J. Crowder Dec 10 '12 at 15:52
@EvgeniyDorofeev, The class file format defines string constants using the [CONSTANT_Utf8_info](http://docs.oracle.com/javase/specs/jvms/se5.0/html/ClassFile.doc.html) struct. The length immediately precedes the byte array in the class file, so two strings, one of which is a substring of another, cannot be combined by the class file generator. Doing prefix checks when converting from UTF8 to UTF16 would require building and walking a trie so there would be some start-up time impact. – Mike Samuel Dec 10 '12 at 15:52
1

@T.J.Crowder, an especially bad idea when your JVM allows people to mutate backing arrays via `Field.setAccessible`. – Mike Samuel Dec 10 '12 at 15:55

score 4 · Answer 2 · answered Dec 10 '12 at 15:38

4

Can anybody explain the reason why s1 and s2 cannot share the same char array?

Because "1234" is not the same sequence of characters as "123".

answered Dec 10 '12 at 15:38

Matt Ball

354,903
100
647
710

But AFAIK `"1234".substring(0,2)` might actually do use the same `char[]`. Won't it? (**EDIT:** Tested, it does use the same `char[]`, even though it is not the same sequence. It uses the `size` and `offset` variables for it). – amit Dec 10 '12 at 15:39
`#substring()` calls _can_ reference the same `char[]` as `this` but there's no guarantee that it will. But the OP's code does not include a `#substring()` call. – Matt Ball Dec 10 '12 at 15:41
But in some cases, it actually does.Thus, I do not understand why the fact that "1234" is not the same char sequence as "123" is an answer to this question, can you elaborate? – amit Dec 10 '12 at 15:43

score 0 · Answer 3 · answered Dec 16 '12 at 07:43

My take is the reason that JVMs don't go to that length in interning the strings is that it is simply not worth it:

A naive implementation of interning that minimized the usage of space as you propose would have O(N^2) performance where N is the number of characters of unique string data that is interned in the lifetime of the JVM. (OK, it is a bit more complicated than that ... but it is expensive.)

An implementation that attempted to avoid the O(N^2) problem would typically end up using more space to avoid the problem than is saved by sharing character arrays.

The String implementation (including interning) is a pragmatic implementation that balances the competing concerns to give the best performance when averaged over a range of real-world applications.

Could JVM be smarter about sharing String data?

3 Answers3