9

I know that for Oracle Java 1.7 update 6 and newer, when using String.substring, the internal character array of the String is copied, and for older versions, it is shared. But I found no offical API that would tell me the current behavior.

Use Case

My use case is: In a parser, I like to detect whether String.substring copies or shares the underlying character array. The problem is, if the character array is shared, then my parser needs to explicitly "un-share" using new String(s) to avoid memory problems. However, if String.substring anyway copies the data, then this is not necessary, and explicitly copying the data in the parser could be avoided. Use case:

// possibly the query is very very large
String query = "select * from test ...";
// the identifier is used outside of the parser
String identifier = query.substring(14, 18);

// avoid if possible for speed,
// but needed if identifier internally 
// references the large query char array
identifier = new String(identifier);

What I Need

Basically, I would like to have a static method boolean isSubstringCopyingForSure() that would detect if new String(..) is not needed. I'm OK if detection doesn't work if there is a SecurityManager. Basically, the detection should be conservative (to avoid memory problems, I'd rather use new String(..) even if not necessary).

Options

I have a few options, but I'm not sure if they are reliable, specially for non-Oracle JVMs:

Checking for the String.offset field

/**
 * @return true if substring is copying, false if not or if it is not clear
 */
static boolean isSubstringCopyingForSure() {
    if (System.getSecurityManager() != null) {
        // we can not reliably check it
        return false;
    }
    try {
        for (Field f : String.class.getDeclaredFields()) {
            if ("offset".equals(f.getName())) {
                return false;
            }
        }
        return true;
    } catch (Exception e) {
        // weird, we do have a security manager?
    }
    return false;
}

Checking the JVM version

static boolean isSubstringCopyingForSure() {
    // but what about non-Oracle JREs?
    return System.getProperty("java.vendor").startsWith("Oracle") &&
           System.getProperty("java.version").compareTo("1.7.0_45") >= 0;
}

Checking the behavior There are two options, both are rather complicated. One is create a string using custom charset, then create a new string b using substring, then modify the original string and check whether b is also changed. The second options is create huge string, then a few substrings, and check the memory usage.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132

4 Answers4

3

This is not a detail you need to care about. No really! Just call identifier = new String(identifier) in both cases (JDK6 and JDK7). Under JDK6 it will create a copy (as desired). Under JDK7, because the substring is already a unique string the constructor is essentially a no-op (no copy is performed -- read the code). Sure there is a slight overhead of object creation, but because of object reuse in the Younger generation, I challenge you to qualify a performance difference.

brettw
  • 10,664
  • 2
  • 42
  • 59
  • Good idea! I will measure the performance overhead of `new String(s)` and post it. – Thomas Mueller Nov 28 '13 at 08:24
  • I'm interested in your result. In JDK7, ``identifier = new String(identifier)`` should be no more than a few microseconds (maybe less!), so even with tens of thousands of Strings I would be surprised to see a performance difference greater than a few milliseconds. – brettw Nov 28 '13 at 08:29
  • I think you are right, it doesn't make a big difference. I still like to keep the question open, because maybe somebody comes up with a creative solution. – Thomas Mueller Nov 28 '13 at 09:51
  • Please mark this answer correct if it addresses your question and you feel you have collected enough other answers. – brettw Nov 29 '13 at 00:18
3

Right, indeed this change was made in 7u6. There is no API change for this, as this change is strictly an implementation change, not an API change, nor is there an API to detect which behavior the running JDK has. However, it is certainly possible for applications to notice a difference in performance or memory utilization because of the change. In fact, it's not difficult to write a program that works in 7u4 but fails in 7u6 and vice-versa. We expect that the tradeoff is favorable for the majority of applications, but undoubtedly there are applications that will suffer from this change.

It's interesting that you're concerned about the case where string values are shared (prior to 7u6). Most people I've heard from have the opposite concern, where they like the sharing and the 7u6 change to unshared values is causing them problems (or, they're afraid it will cause problems).

In any case the thing to do is measure, not guess!

First, compare the performance of your application between similar JDKs with and without the change, e.g. 7u4 and 7u6. Probably you should be looking at GC logs or other memory monitoring tools. If the difference is acceptable, you're done!

Assuming that the shared string values prior to 7u6 cause a problem, the next step is to try the simple workaround of new String(s.substring(...)) to force the string value to be unshared. Then measure that. Again, if the performance is acceptable on both JDKs, you're done!

If it turns out that in the unshared case, the extra call to new String() is unacceptable, then probably the best way to detect this case and make the "unsharing" call conditional is to reflect on a String's value field, which is a char[], and get its length:

int getValueLength(String s) throws Exception {
    Field field = String.class.getDeclaredField("value");
    field.setAccessible(true);
    return ((char[])field.get(s)).length;
}

Consider a string resulting from a call to substring() that returns a string shorter than the original. In the shared case, the substring's length() will differ from the length of the value array retrieved as shown above. In the unshared case, they'll be the same. For example:

String s = "abcdefghij".substring(2, 5);
int logicalLength = s.length();
int valueLength = getValueLength(s);

System.out.printf("%d %d ", logicalLength, valueLength);
if (logicalLength != valueLength) {
    System.out.println("shared");
else
    System.out.println("unshared");

On JDKs older than 7u6, the value's length will be 10, whereas on 7u6 or later, the value's length will be 3. In both cases, of course, the logical length will be 3.

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • Very good answer! Except that I don't like to call `setAccessible(true)` in a library :-) It turned out that `s = new String(s)` is reasonably fast for Java 7u6 / Java 8, so I will go with that solution I think. – Thomas Mueller Nov 29 '13 at 11:37
  • Thanks! Good to know that the simple `s = new String(s)` workaround is acceptable. I'd be interested to hear how much of a problem the shared string value approach prior to 7u6 is causing. – Stuart Marks Dec 02 '13 at 07:27
  • This is a *terrible* answer. The behavior of a specific implementation is irrelevant to the question. The API contract does not specific whether the underlying array is copied or not. It might be in OpenJDK, and not in IBM's JDK, and some other JDK might copy it only if there's a significant differential in size (for some unstated value of "significant"). – Lawrence Dol Apr 15 '20 at 18:33
1

In older Java versions, String.substring(..) will use the same char array as the original, with a different offset and count.

In the latest Java versions (according to the comment by Thomas Mueller: since 1.7 Update 6), this has changed, and substrings are now be created with a new char array.

If you parse lots of sources, the best way to deal with it is to avoid checking the Strings' internals, but anticipate this effect and always create new Strings where you need them (as in the first code block in your question).

String identifier = query.substring(14, 18);
// older Java versions: backed by same char array, different offset and count
// newer Java versions: copy of the desired run of the original char array

identifier = new String(identifier);
// older Java versions: when the backed char array is larger than count, a copy of the desired run will be made
// newer Java versions: trivial operation, create a new String instance which is backed by the same char array, no copy needed.

That way, you end up with the same result with both variants, without having to distinguish them and without unnecessary array copy overhead.

Peter Walser
  • 15,208
  • 4
  • 51
  • 78
0

Are you sure, that making string copy is really expensive? I belive that JVM optimizer has intrinsics about strings and avoids unnecessary copies. Also large texts are parsed with one-pass algorithms such as LALR automata, generated by compiler compilers. So, the parser input usually be an java.io.Reader or another streaming interface, not a solid String. Parsing is usially expensive itself, still not as expensive as type checking. I don't think that copying strings is a real bottleneck. You better experience with profiler and with microbenchmarks before your assumptions.

Alexey Andreev
  • 1,980
  • 2
  • 17
  • 29
  • In my case the parser input is a String (it's the JDBC API). But I will measure the performance difference and post it. – Thomas Mueller Nov 28 '13 at 08:29
  • If you process lots of strings and only accumulate small parts of them (such as in the example, where only the identifier is extracted from a query), a lot of memory overhead is generated (identifier strings backed up by same char array as the source), so memory problems are quite plausible. – Peter Walser Nov 28 '13 at 08:40
  • When you process string through `Reader`, you won't get `substring`, therefore no risk of memory leak exists. In the above case I make two assumptions: 1. Strings are not **really** large 2. They are thrown away after processing a call. So we don't get any significant memory overhead, if they are right. – Alexey Andreev Nov 28 '13 at 08:45
  • The input is a String (as I wrote, it is the JDBC API: I am parsing SQL statements). I will not create a `Reader` from a `String` as that would be even less efficient as always using `new String(s)`. Queries are usually about 100, but sometimes many thousand characters long. – Thomas Mueller Nov 28 '13 at 09:03