32

Why by default only literal strings are saved in the intern pool?

Example from MSDN:

String s1 = "MyTest";
String s2 = new StringBuilder().Append("My").Append("Test").ToString(); 
String s3 = String.Intern(s2); 
Console.WriteLine("s1 == '{0}'", s1);
Console.WriteLine("s2 == '{0}'", s2);
Console.WriteLine("s3 == '{0}'", s3);
Console.WriteLine("Is s2 the same reference as s1?: {0}", (Object)s2==(Object)s1); 
Console.WriteLine("Is s3 the same reference as s1?: {0}", (Object)s3==(Object)s1);

/*
This example produces the following results:
s1 == 'MyTest'
s2 == 'MyTest'
s3 == 'MyTest'
Is s2 the same reference as s1?: False
Is s3 the same reference as s1?: True
*/
gdoron
  • 147,333
  • 58
  • 291
  • 367
  • 4
    Why should every string be interned? – BoltClock Dec 14 '11 at 17:37
  • 3
    @BoltClock, Why should every literal string be interned? – gdoron Dec 14 '11 at 17:39
  • 2
    The main reason literals are interned is to save memory space that's otherwise wasted by repeating the same string many times throughout a program, not to mention it's laughably easy to do it. But I can't think of a reason for non-literal strings to be interned. – BoltClock Dec 14 '11 at 17:40

3 Answers3

62

The short answer: interning literal strings is cheap at runtime and saves memory. Interning non-literal strings is expensive at runtime and therefore saves a tiny amount of memory in exchange for making the common cases much slower.

The cost of the interning-strings-at-runtime "optimization" does not pay for the benefit, and is therefore not actually an optimization. The cost of interning literal strings is cheap and therefore does pay for the benefit.

I answer your question in more detail here:

https://ericlippert.com/2009/09/28/string-interning-and-string-empty

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
  • Thanks! excellent blog you got over there... =) Maybe you can answer a [related question](http://stackoverflow.com/questions/8485515/meaning-of-confusing-comment-above-string-empty-in-net-bcl-source) about string.Empty? – gdoron Dec 14 '11 at 18:04
  • 2
    Does interning literal strings have *ANY* run-time cost? By my understanding, if six assemblies each use the literal string "Quack" at least once, then six instances of the string "Quack" will be created--one for each assembly, without the run-time bothering to examine their contents. Because of Reflection, .net must generally assume that it will always be possible for any function to be executed and thus for any literal string to be accessed, so the lifetime of a literal string will be the lifetime of the assembly where it appears. I'd consider "cheap at runtime" to be an understatement. – supercat Dec 14 '11 at 18:44
  • 2
    @supercat: Your understanding is correct; literal string interning is done on a per-assembly basis. As for your question about the lifetime of such a string, I do not know the answer off the top of my head. The runtime could I suppose deallocate such a string when there are no live references to it, and reconstitute it later; how would you be able to tell that the reference was different? But it could just as easily keep it alive forever. – Eric Lippert Dec 14 '11 at 18:49
  • 2
    @gdoron: I actually don't know the answer to that question; I have wondered the same thing myself. I'm having lunch with some of the CLR old timers today; I'll see if any of them remember. And I'm glad you like the blog. – Eric Lippert Dec 14 '11 at 18:51
  • Wait, so if one assembly has "Quack" and another assembly has "Quack", they will have two strings that are interned, and have identical contents, but will be different references? I thought the entire point of interning was so that identical contents will be identical references... – configurator Dec 17 '11 at 13:50
  • @configurator: In the case of literals we only guarantee reference identity of usages of that literal in the same assembly. Interning literals across assemblies has a runtime performance cost; if you want to take on that expense then you have to write the code to do it yourself. – Eric Lippert Dec 17 '11 at 16:43
  • 1
    Oh dear. I think there's a potential bug in my compiler than - I'm using a `Symbol` struct which is simply an immutable wrapper around `string` that calls `Intern` and keeps the result - and uses reference comparisons instead of string comparisons. Does calling `Intern` manually on two strings pre-interned differently across assemblies keep two different copies? – configurator Dec 22 '11 at 18:11
  • @configurator: I have no idea. What an interesting question! I think you'll have to try it and see. – Eric Lippert Dec 22 '11 at 18:14
  • After some testing: fortunately for me, this doesn't happen in VS 2010 - strings are interned as equal across assembly boundaries. However, that means I can't check the answer to my question either. – configurator Dec 22 '11 at 18:37
  • That [related question](http://stackoverflow.com/questions/8485515/meaning-of-confusing-comment-above-string-empty-in-net-bcl-source) got an answer. Is it the right answer? – gdoron Feb 08 '12 at 10:17
  • Great answer. I'm so unhappy since you aren't in C# team anymore. – Johnny_D Apr 24 '13 at 15:19
  • 1
    The edit queue is full, so I'll comment instead. Here's the non-dead version of the link to Eric's blog post (linked in this answer): https://ericlippert.com/2009/09/28/string-interning-and-string-empty/ – William Nov 21 '22 at 19:34
23

The language designers decided the cost of interning every intermediate string value was not worth the performance cost. Interning of garbage-collectible strings requires a single global weak map which can become a bottleneck when you have large numbers of threads.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • @EricLippert: Interning strings wouldn't require a global weak map. If each string held an initially-blank reference to a string with the same content which ranked higher on some metric (perhaps age), then strings could be compared by chasing down the references to the highest-ranking equivalent strings. Two strings with the same "best" equivalent could be considered equal without having to examine their contents; otherwise, if two strings were examined and found to be equal, one of the "equivalent" strings could be set to be an equivalent of the other. – supercat Jan 01 '15 at 18:20
  • @EricLippert: For such a design to avoid memory leaks it would probably be necessary to have GC support (so that if a string has a reference to a better string, any "better string" reference which identifies it would be changed to identify the better one) but such a design could in many usage scenarios offer noticeable benefit without needing *any* kind of hash map, and could work quite nicely with per-thread hash maps (note that if `x` and `y` are found to be equal, and `a` and `b` likewise, finding `x` equal to `a` would imply it was equal to `b`, and that `y` was equal to both `a` and `b`. – supercat Jan 01 '15 at 18:22
  • @EricLippert: Consequently, if each thread had its own hash map, two threads which both created the string "Fred" would start out holding separate references, but both would get interned to the same backing store as soon as any "Fred" from one thread was compared to any "Fred" from the other. – supercat Jan 01 '15 at 18:23
  • @supercat, Interning has to be explicit because the Java language requires that `new X()` is distinct from all references that existed prior, and string has a public constructor. – Mike Samuel Jan 02 '15 at 17:08
  • @MikeSamuel: Interning of *explicitly exposed* references must be explicit, but objects hold references to backing stores which are *neither* modified nor exposed to outside code, those backing stores could be interned implicitly. For example, in Java given `String A="foo", B=new String(A);`, `B` would be a new `String` object, but both `A` and `B` would hold references to the same `char[]`. Interestingly, the implementation of `String.equals()` I looked at (I think v7) doesn't check for reference equality of the `char[]` instances even though two distinct string objects could... – supercat Jan 03 '15 at 17:28
  • ...be bound to the same backing store. I would guess that having `String.equals` merge `char[]` arrays for long strings that turned out to be equal could be a performance win if the memory model could support it, were it not for the existence of code that uses Reflection or native methods to manipulate the backing stores of `String` objects. If it had always been recognized that a `String` object could legitimately change among equivalent backing stores and thus modifying the `char[]` instances was 100% illegitimate, that could be a useful optimization. – supercat Jan 03 '15 at 17:31
  • @supercat, Yeah. I think you understand the reflection problems; it's unfortunate, but all backing stores are explicitly visible and mutable in the presence of `java.lang.reflect.Field.setAccessible`. If this topic of string interning interests you, you might look at attempts to address the umbrella group of problems: [value types for Java](https://blogs.oracle.com/jrose/entry/value_types_in_the_vm). – Mike Samuel Jan 03 '15 at 18:10
  • @MikeSamuel: I really wish Java had made `string` behave as a primitive type, most likely implemented as a reference which could only be dereferenced using native methods. To be useful, it would need to allow member-style method invocation, something other primitives don't, but IMHO all primitives should have allowed that anyway. IMHO, `someLong = someFloat.roundLong()` would be much cleaner and less trouble-prone than `someLong = Math.round((double)someFloat);`. – supercat Jan 03 '15 at 18:27
  • @MikeSamuel: Having `string` be a primitive type would have allowed the `==` operator to consistently test equality, and would also have allowed improved performance in many cases. At present, accessing a character from a string requires two levels of indirection; if a `string` variable could either hold a `char[]`, a `byte[]` or some other representation, then in the case where it held a `byte[]` or `char[]` accessing a character would require two reads from the same base and two "constant" compare-to-constant operations. – supercat Jan 03 '15 at 18:32
3

Interning strings would provide almost no benefit in most string usage scenarios, even if one had a zero-cost weak-reference interning pool (the ideal interning implementation). In order for string interning to offer any benefit, it is necessary that multiple references to coincidentally-equal strings be kept for a reasonably "long" time.

Consider the following two programs:

  1. Input 100,000 lines from a text file, each containing some arbitrary text, and then 100,000 five-digit numbers. Regard each number read in as a zero-based index into the list of 100,000 lines that were read in, and output the corresponding line to the output.
  2. Input 100,000 lines from a text file, outputing every line that contains the character sequence "fnord".

For the first program, depending upon the contents of the text file, string interning might generate almost a 50,000:1 savings in memory (if the line contained 100,000 identical long lines of text) or might represent a total waste (if all 100,000 lines are different). In the absence of string interning, an input file with 100,000 identical lines would cause 100,000 live instances of the same string to exist simultaneously. With string interning, the number of live instances could be reduced to two. Of course, there's no way a compiler can even try to guess whether the input file is apt to contain 100,000 identical lines, 100,000 different lines, or something in-between.

For the second program, it's unlikely that even an ideal string-interning implementation would offer much benefit. Even if all 100,000 lines of the input file happened to be identical, interning couldn't save much memory. The effect of interning isn't to prevent the creation of redundant string instances, but rather to allow redundant string instances to be identified and discarded. Since each line can be discarded once it has been examined and either output or not, the only thing interning could buy would be the (theoretical) ability to discard redundant string instances (very) slightly sooner than would otherwise be possible.

There may be benefits in some cases to caching certain 'intermediate' string results, but that's a task that's really best left to the programmer. For example, I have a program which needs to convert a lot of bytes to two-digit hex strings. To facilitate that, I have an array of 255 strings which hold the string equivalents of values from 00 to FF. I know that, on average, each string in that array will be used, at minimum, hundreds or thousands of times, so caching those strings is a huge win. On the other hand, the strings can only be cached because I know what they represent. I may know that, for any n 0-255, String.Format("{0:X2}",n) will always yield the same value, but I wouldn't expect a compiler to know that.

supercat
  • 77,689
  • 9
  • 166
  • 211