Interning strings would provide almost no benefit in most string usage scenarios, even if one had a zero-cost weak-reference interning pool (the ideal interning implementation). In order for string interning to offer any benefit, it is necessary that multiple references to coincidentally-equal strings be kept for a reasonably "long" time.
Consider the following two programs:
- Input 100,000 lines from a text file, each containing some arbitrary text, and then 100,000 five-digit numbers. Regard each number read in as a zero-based index into the list of 100,000 lines that were read in, and output the corresponding line to the output.
- Input 100,000 lines from a text file, outputing every line that contains the character sequence "fnord".
For the first program, depending upon the contents of the text file, string interning might generate almost a 50,000:1 savings in memory (if the line contained 100,000 identical long lines of text) or might represent a total waste (if all 100,000 lines are different). In the absence of string interning, an input file with 100,000 identical lines would cause 100,000 live instances of the same string to exist simultaneously. With string interning, the number of live instances could be reduced to two. Of course, there's no way a compiler can even try to guess whether the input file is apt to contain 100,000 identical lines, 100,000 different lines, or something in-between.
For the second program, it's unlikely that even an ideal string-interning implementation would offer much benefit. Even if all 100,000 lines of the input file happened to be identical, interning couldn't save much memory. The effect of interning isn't to prevent the creation of redundant string instances, but rather to allow redundant string instances to be identified and discarded. Since each line can be discarded once it has been examined and either output or not, the only thing interning could buy would be the (theoretical) ability to discard redundant string instances (very) slightly sooner than would otherwise be possible.
There may be benefits in some cases to caching certain 'intermediate' string results, but that's a task that's really best left to the programmer. For example, I have a program which needs to convert a lot of bytes to two-digit hex strings. To facilitate that, I have an array of 255 strings which hold the string equivalents of values from 00 to FF. I know that, on average, each string in that array will be used, at minimum, hundreds or thousands of times, so caching those strings is a huge win. On the other hand, the strings can only be cached because I know what they represent. I may know that, for any n
0-255, String.Format("{0:X2}",n)
will always yield the same value, but I wouldn't expect a compiler to know that.