45

Lots of people talk about the performance advantages of String.intern(), but I'm actually more interested in what the performance penalty may be.

My main concerns are:

  • Search cost: The time that intern() takes to figure out if the internable string exists in the constants pool. How does that cost scale with the number of strings in that pool?
  • Synchronization: obviously the constant pool is shared by the whole JVM. How does that pool behave when intern() is being called over and over from multiple threads? How much locking does it perform? How does the performance scale with contention?

I am concerned about all these things because I'm currently working on a financial application that has a problem of using too much memory because of duplicated Strings. Some strings basically look like enumerated values and can only have a limited number of potential values (such as currency names ("USD", "EUR")) exist in more than a million copies. String.intern() seems like a no-brainer in this case, but I'm worried about the synchronization overhead of calling intern() everytime I store a currency somewhere.

On top of that, some other types of strings can have millions of different values, but still have tens of thousands of copies of each (such as ISIN codes). For these, I'm concerned that interning a million string would basically slow down the intern() method so much as to bog down my application.

LordOfThePigs
  • 11,050
  • 7
  • 45
  • 69
  • 2
    Dup: http://stackoverflow.com/questions/5061611/search-cost-of-string-interning-and-declaration-of-literal-strings – skaffman May 16 '12 at 18:12
  • 1
    @skaffman There's no in-depth performance analysis there. – Marko Topolnik May 16 '12 at 18:14
  • 2
    @skaffman I saw that question you're linking to but it doesn't discuss perfomance scaling for the search cost, and it doesn't touch on the subject of synchronization. – LordOfThePigs May 16 '12 at 18:21

5 Answers5

40

I did a little bit of benchmarking myself. For the search cost part, I've decided to compare String.intern() with ConcurrentHashMap.putIfAbsent(s,s). Basically, those two methods do the same things, except String.intern() is a native method that stores and read from a SymbolTable that is managed directly in the JVM, and ConcurrentHashMap.putIfAbsent() is just a normal instance method.

You can find the benchmark code on github gist (for a lack of a better place to put it). You can also find the options I used when launching the JVM (to verify that the benchmark is not skewed) in the comments at the top of the source file.

Anyway here are the results:

Search cost (single threaded)

Legend

  • count: the number of distinct strings that we are trying to pool
  • initial intern: the time in ms it took to insert all the strings in the string pool
  • lookup same string: the time in ms it took to lookup each of the strings again from the pool, using exactly the same instance as was previously entered in the pool
  • lookup equal string: the time in ms it took to lookup each of the strings again from the pool, but using a different instance

String.intern()

count       initial intern   lookup same string  lookup equal string
1'000'000            40206                34698                35000
  400'000             5198                 4481                 4477
  200'000              955                  828                  803
  100'000              234                  215                  220
   80'000              110                   94                   99
   40'000               52                   30                   32
   20'000               20                   10                   13
   10'000                7                    5                    7

ConcurrentHashMap.putIfAbsent()

count       initial intern   lookup same string  lookup equal string
1'000'000              411                  246                  309
  800'000              352                  194                  229
  400'000              162                   95                  114
  200'000               78                   50                   55
  100'000               41                   28                   28
   80'000               31                   23                   22
   40'000               20                   14                   16
   20'000               12                    6                    7
   10'000                9                    5                    3

The conclusion for the search cost: String.intern() is surprisingly expensive to call. It scales extremely badly, in something of O(n) where n is the number of strings in the pool. When the number of strings in the pool grows, the amount of time to lookup one string from the pool grows much more (0.7 microsecond per lookup with 10'000 strings, 40 microseconds per lookup with 1'000'000 strings).

ConcurrentHashMap scales as expected, the number of strings in the pool has no impact on the speed of the lookup.

Based on this experiment, I'd strongly suggest avoiding to use String.intern() if you are going to intern more than a few strings.

Community
  • 1
  • 1
LordOfThePigs
  • 11,050
  • 7
  • 45
  • 69
  • I ended up not benchmarking the synchronization overhead. The single threaded performance scales so badly that it renders the synchronization point moot. – LordOfThePigs Jul 17 '12 at 14:05
  • Does https://blog.codecentric.de/en/2012/03/save-memory-by-using-string-intern-in-java/ not contradict you finding looking overall ? I had some confusions when solving http://stackoverflow.com/questions/20027495/how-use-of-operator-brings-in-performance-improvements-compared-to-equals/20027571?noredirect=1#comment29826233_20027571 – Harish Kayarohanam Nov 17 '13 at 09:22
  • @HarishKayarohanam I don't think it contradicts my findings. He is only measuring the memory advantages of interning. When it comes to memory efficiency String.intern() works as expected. In terms of CPU performance though, it behaves very differently from "basically a HashSet that stores the String in the Permanent Generation". The other answer from mik1 gives an idea of why it's that way. Anyway, using a HashSet that you manage yourself is more flexible and behaves more consistently without a need to tweak the VM options, so it's a better solution for re-using a lot of strings. – LordOfThePigs Nov 17 '13 at 23:30
  • 2
    Did you try benchmark with different sizes of string pool table: http://java-performance.info/string-intern-in-java-6-7-8/ – Martin May 03 '14 at 09:25
  • I did not because I wasn't aware of the fact that you could set the size of this table. Anyway, String.intern() has become better in the later java versions, but if possible I'd still rather use my own map. Using your own map has the extra advantage that you can control the life cycle of your instances and let them get garbage collected if necessary. – LordOfThePigs May 03 '14 at 12:15
  • 1
    What Java version this was measured on? – Tagar Aug 01 '18 at 17:58
  • 1
    @Tagar, it is described in the GitHub Gist [linked in the answer](https://gist.github.com/npiguet/2715360#file-internbench-java-L5): 1.7.0_03 – Marcono1234 Aug 02 '19 at 10:16
24

I have recently written an article about String.intern() implementation in Java 6, 7 and 8: String.intern in Java 6, 7 and 8 - string pooling.

There is a -XX:StringTableSize JVM parameter, which will allow you to make String.intern extremely useful in Java7+. So, unfortunately I have to say that this question is currently giving the misleading information to the readers.

mik1
  • 568
  • 4
  • 7
  • +1 for explaining the reasons why `String.intern` sucks, but the default implementation still seems bad enough in Java 7 that I'd stick with manual string pooling. – user2357112 Aug 25 '13 at 08:19
  • 5
    So just to make it clear for other readers, your article explains that the String.intern pools is basically a HashSet whose underlying HashMap has a fixed number of buckets that is decided by a VM command line option. However, that HashMap is kind of degenerated and doesn't know how to increase the number of buckets it contains. This is why String.intern performs close to O(1) for a few strings, but degrades to O(n) for a lot more strings. – LordOfThePigs Nov 17 '13 at 23:34
5

I've found it better to use a fastutil hash table and do my own interning rather than reuse String.intern(). Using my own hashtable means that I can make my own decisions about concurrency, and I'm not competing for PermGen space.

I did this because I was working on a problem that had, as it were, millions of strings, many identical, and I wanted to (a) reduce footprint and (b) allow comparison by identity. For my problem, things were better with interning than without, using my notString.intern() approach.

YMMV.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
1

String.intern become slow is becuase two reasons:
1. the -XX:StringTableSize limitation.
In java,it uses a internal hashtable to manage string cache,in java 6,the default StringTableSize value is 1009,which means string.intern is O(the number of string object/ 1009),when more and more string object been created,it's becoming slower.

\openjdk7\hotspot\src\share\vm\classfile\symbolTable.cpp

oop StringTable::intern(Handle string_or_null, jchar* name,  
                        int len, TRAPS) {  
  unsigned int hashValue = java_lang_String::hash_string(name, len);  
  int index = the_table()->hash_to_index(hashValue);  
  oop string = the_table()->lookup(index, name, len, hashValue);  
  // Found  
  if (string != NULL) return string;  
  // Otherwise, add to symbol to table  
  return the_table()->basic_add(index, string_or_null, name, len,  
                                hashValue, CHECK_NULL);  
}

2. In java 6,the string cache pool is in the perm area,not in the heap,Most of the time,we config the perm size relatively small.

dingjsh
  • 367
  • 1
  • 2
  • 15
-2

The following micro benchmark suggests using an enum offers around a ten times performance improvement (the usual micro benchmark caveats apply) test code as follows:

public class Test {
   private enum E {
      E1;
      private static final Map<String, E> named = new HashMap<String, E>();
      static {
         for (E e : E.values()) {
            named.put( e.name(), e );
         }
      }

      private static E get(String s) {
         return named.get( s );
      }
   }

   public static void main(String... strings) {
      E e = E.get( "E1" ); // ensure map is initialised

      long start = System.nanoTime();
      testMap( 10000000 );
      long end = System.nanoTime();

      System.out.println( 1E-9 * (end - start) );
   }

   private static void testIntern(int num) {
      for (int i = 0; i < num; i++) {
         String s = "E1".intern();
      }
   }

   private static void testMap(int num) {
      for (int i = 0; i < num; i++) {
         E e = E.get( "E1" );
      }
   }
}

Results (10 million iterations): testIntern() - 0.8 seconds testMap() - 0.06 seconds

Of course YMMV, but enums offer so many benefits over Strings...type-safety over other random Strings, ability to add methods etc. seems the best way to go imho

Nathan Adams
  • 1,255
  • 1
  • 11
  • 17
  • yeah... Except you've interned only 1 distinct string... Nobody has performance problems interning one 1 distinct string. Not to mention that your micro benchmark is totally invalid to begin with, since the constant "E1" is not only already interned, but also inlined in the generated bytecode. When I say that I inline 1'000'000 string in my test. I mean 1'000'000 string that are different, and for which anyString.equals(anyOtherString) is false. So I'm not sure what your benchmark is trying to measure. The peformance of map.get with a single element in the map? – LordOfThePigs May 19 '14 at 13:52
  • Yeah note the "usual micro benchmark caveats apply" note at the beginning. My guess would be the difference comes from String.intern() needing to be thread-safe as it has to read and write to it's cache, while the Map implementation doesn't need to be thread-safe as it's read only. – Nathan Adams May 19 '14 at 14:40
  • Not really. the difference as measured by your microbenchmark, is that that you only ever put a single element in the map. I'm pretty sure intern would have a similar performance with just a single string in the pool. – LordOfThePigs May 19 '14 at 14:47
  • If the number of Strings interned from just that one class is enough to give a performance hit over a Map, the jvm is doing something seriously wrong, besides it would be proof a Map is better in this situation too. Anyway, the use case for the Map would be specifically for the currencies that were mentioned, and for any other "strings basically look like enumerated values and can only have a limited number of potential values", not all the million different Strings, so we may be talking at cross purposes somewhat in any case. – Nathan Adams May 19 '14 at 15:06