33

Before anyone questions the fact of using string.intern() at all, let me say that I need it in my particular application for memory and performance reasons. [1]

So, until now I used String.intern() and assumed it was the most efficient way to do it. However, I noticed since ages it is a bottleneck in the software. [2]

Then, just recently, I tried to replace the String.intern() by a huge map where I put/get the strings in order to obtain each time a unique instance. I expected this would be slower... but it was exactly the opposite! It was tremendously faster! Replacing the intern() by pushing/polling a map (which achieves exactly the same) resulted in more than one order of magnitude faster.

The question is: why is intern() so slow?!? Why isn't it then simply backed up by a map (or actually, just a customized set) and would be tremendously faster? I'm puzzled.

[1]: For the unconvinced ones: It is in natural language processing and has to process gigabytes of text, therefore needs to avoid many instances of a same string to avoid blowing up the memory and referential string comparison to be fast enough.

[2]: without it (normal strings) it is impossible, with it, this particular step remains the most computation intensive one

EDIT:

Due to the surprising interest in this post, here is some code to test it out:

http://pastebin.com/4CD8ac69

And the results of interning a bit more than 1 million strings:

  • HashMap: 4 seconds
  • String.intern(): 54 seconds

Due to avoid some warm-up / OS IO caching and stuff like this, the experiment was repeated by inverting the order of both benchmarks:

  • String.intern(): 69 seconds
  • HashMap: 3 seconds

As you see, the difference is very noticeable, more than tenfolds. (Using OpenJDK 1.6.0_22 64bits ...but using the sun one resulted in similar results I think)

dagnelies
  • 5,203
  • 5
  • 38
  • 56
  • Great question, can you give some metrics on how much slower `intern()` is? – Steve Kuo Aug 31 '11 at 21:11
  • How many string instances do you end up assinging to the map/interning? – John Vint Aug 31 '11 at 21:13
  • 5
    Are you comparing `intern` to a synchronized `Hashtable` or an unsynchronized `HashMap`? `intern` is synchronized, so that might be a large part of what you are seeing. – Chris Dodd Aug 31 '11 at 21:14
  • what kind of string manipulations? (e.g. substring extraction vs. deletion/editing vs. concatenations) You might want to look at ropes. I'm also curious how you get the strings in the first place: when you do readline() or whatever, are strings that are equal() different objects?! – Jason S Aug 31 '11 at 21:16
  • Is your implementation thread safe? Does it use weak references? – Laurence Gonsalves Aug 31 '11 at 21:19
  • @Chris: intern() is *not* synchronized, at least at the Java language level. – Michael Borgwardt Aug 31 '11 at 21:21
  • @Michael Borgwardt - You can imagine the only actual time it will be blocked would be on the adds, but for all reads are non-blocking – John Vint Aug 31 '11 at 21:41
  • For interning I'd recommend http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/Interners.html – maaartinus Aug 31 '11 at 21:43
  • @Chris: using normal hash maps or synchronized hash maps did not make a noticeable difference. – dagnelies Sep 01 '11 at 08:37
  • your benchmark timings show intern being the faster of the two, but this contradicts much of your text. Is this correct? – Ron Sep 02 '11 at 17:33
  • @Ron: thanks for the remark, you are right, I inverted both! Let me correct this... – dagnelies Sep 02 '11 at 18:14
  • I don't really have an answer as to exactly why its so slow, but I have done a more detailed implementation of a performance test for one of my questions. http://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern – LordOfThePigs May 17 '12 at 01:37

4 Answers4

6

This article discusses the implementation of String.intern(). In Java 6 and 7, the implementation used a fixed size (1009) hashtable so as the number entries grew, the performance became O(n). The fixed size can be changed using -XX:StringTableSize=N. Apparently, in Java8 the default size is larger but issue remains.

Martin Serrano
  • 3,727
  • 1
  • 35
  • 48
4

Most likely reason for the performance difference: String.intern() is a native method, and calling a native method incurs massive overhead.

So why is it a native method? Probably because it uses the constant pool, which is a low-level VM construct.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • 1
    I would have thought object pooling at the JVM level would be faster. – Steve Kuo Aug 31 '11 at 23:35
  • @Steve: me too ...i thought everything that is "native" was made for performance reasons – dagnelies Sep 01 '11 at 08:39
  • @Steve: the actual pooling stuff might be faster, but the difference is probably irrelevant compared to the native call's overhead. – Michael Borgwardt Sep 01 '11 at 09:02
  • 6
    @Arnaud: Not at all. Using native method for performance reasons may have been warranted in the very early years of Java when JVMs were much, much slower. But that changed over 10 years ago. Nowadays pretty much the only reason to use JNI is to access functionality that's not part of the Java standard API. – Michael Borgwardt Sep 01 '11 at 09:06
  • 4
    I can tell you for sure that the reason it is slow is not because it's a native method. If that was the problem, you'd expect to pay the same performance penalty (for crossing the JVM-native barrier) no matter the size of the string pool. My benchmarks (http://stackoverflow.com/questions/10624232) show that the complexity of String.intern() is in O(n^2) where n is the number of strings in the pool. The real answer to the question is: The algorithm used by String.intern is bad and doesn't scale. – LordOfThePigs Jul 17 '12 at 14:13
  • 1
    @MichaelBorgwardt In this case I think the JNI is used to do magic on the otherwise static memory location of the characters. Native methods may change fields at will. I don't think intern() should be used for performance optimization at all; at best it is an optimization on memory use. Possibly maybe it could have an effect on `equals` over very long strings (why would you do that) otherwise the effect will be negligible. – Maarten Bodewes Aug 17 '12 at 16:52
  • 1
    Calling a native method does not "incur massive overhead". – Hot Licks Mar 19 '14 at 16:12
  • @HotLicks: that depends on what you consider "massive". There is an overhead, and it is not negligible: http://www.javamex.com/tutorials/jni/overhead.shtml – Michael Borgwardt Mar 19 '14 at 16:46
  • 4
    Perhaps you missed this: "native methods in the standard library don't necessarily go through the JNI". JNI is expensive. Invoking a native method that is part of the JDK is quite cheap. – Hot Licks Mar 19 '14 at 16:51
  • 1
    this answer is just a guess and it is wrong to boot – Martin Serrano Feb 27 '15 at 21:59
3

@Michael Borgwardt said this in a comment:

intern() is not synchronized, at least at the Java language level.

I think that you mean that the String.intern() method is not declared as synchronized in the sourcecode of the String class. And indeed, that is a true statement.

However:

  • Declaring intern() as synchronized would only lock the current String instance, because it is an instance method, not a static method. So they couldn't implement string pool synchronization that way.

  • If you step back and think about it, the string pool has to perform some kind of internal synchronization. If it didn't it would be unusable in a multi-threaded application, because there is simply no practical way for all code that uses the intern() method to do external synchronization.

So, the internal synchronization that the string pool performs could be a bottleneck in multi-threaded application that uses intern() heavily.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 1
    Just for your information, I tested the code with normal and synchronized maps. The difference in performance was negligible. – dagnelies Sep 02 '11 at 13:37
  • @arnaud - yes. If there is no contention, on the Map then the cost of synchronizing is negligible. The performance issues arise when there is significant contention; i.e. lots of different threads trying to use the Map at the same time. – Stephen C Sep 02 '11 at 13:55
  • 1
    `intern()` is definitely synchronized, and that is likely a major factor in the performance difference. `intern()` must make absolutely sure that only one version of the String makes it into the interned pool, and that pool is being hit from all angles -- multiple threads, classes being loaded, native methods dealing with Strings. The synchronization problems are quite complex. – Hot Licks Mar 19 '14 at 16:16
1

I can't speak from any great experience with it, but from the String docs:

"When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the {@link #equals(Object)} method, then the string from the pool returned. Otherwise, this String object is added to the pool and a reference to this String object is returned."

When dealing with large numbers of objects, any solution involving hashing will outperform one that doesn't. I think you're just seeing the result of misusing a Java language feature. Interning isn't there to act as a Map of strings for your use. You should use a Map for that (or Set, as appropriate). The String table is for optimization at the language level, not the app level.

Ryan Stewart
  • 126,015
  • 21
  • 180
  • 199
  • 3
    but wouldn't the built-in intern method use a map to determine if the pool already contains the string? (and if not, seems like an obvious inefficiency that could be improved) – Jason S Aug 31 '11 at 21:14
  • 1
    "Interning isn't there to act as a Map of strings for your use." `String.intern` is public, so it arguably *is* there for your use. Perhaps the OP is using it incorrectly, but your answer doesn't really indicate what is incorrect about his usage. – Laurence Gonsalves Aug 31 '11 at 21:22
  • 2
    HashMap uses equals() as well, and I'l bet you $100 that intern() ultimately uses some sort of hashtable. – Michael Borgwardt Aug 31 '11 at 21:23
  • @Michael Borgwardt You should be more of a betting man :) Just looked at the .cpp in the hotspot vm and it surely uses some form of a hash table. – John Vint Aug 31 '11 at 21:25
  • @John: Why then do you think does intern() perform so bad? Have you found something in the source about that? – alexkelbo Dec 06 '13 at 10:28