Why doesn't string.Substring share memory with the source string?

Question

As we all know, strings in .NET are immutable. (Well, not 100% totally immutable, but immutable by design and used as such by any reasonable person, anyway.)

This makes it basically OK that, for example, the following code just stores a reference to the same string in two variables:

string x = "shark";
string y = x.Substring(0);

// Proof:
fixed (char* c = y)
{
    c[4] = 'p';
}

Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

sharp
sharp

Clearly x and y refer to the same string object. So here's my question: why wouldn't Substring always share state with the source string? A string is essentially a char* pointer with a length, right? So it seems to me the following should at least in theory be allowed to allocate a single block of memory to hold 5 characters, with two variables simply pointing to different locations within that (immutable) block:

string x = "shark";
string y = x.Substring(1);

// Does c[0] point to the same location as x[1]?
fixed (char* c = y)
{
    c[0] = 'p';
}

// Apparently not...
Console.WriteLine(x);
Console.WriteLine(y);

The above outputs:

shark
park

In substring documentation: "This method does not modify the value of the current instance. Instead, it returns a new string that begins at the startIndex position in the current string." I would say that it never should behave like in ur 1st example. If u use substring then it should be expected to create different instances for further modyfication. — Piotr Auguscik, Jun 08 '11 at 05:30
Just to ask...do you really expect *anything* to work when you're sneaking around class invariants? — cHao, Jun 08 '11 at 05:31
Related: http://msdn.microsoft.com/en-us/library/system.string.intern.aspx — Andrew Savinykh, Jun 08 '11 at 05:44
Why doesn't the .net framework store all permutations of the alphabet in memory and we just reference a pointer to the part we need? :-) — benPearce, Jun 08 '11 at 05:53
@benPearce: Ha, are you implying my question is absurd? I really thought it was a reasonable thing to ask... — Dan Tao, Jun 08 '11 at 06:01
@Dan: No, it was simply a joke! But makes sense if you take your points to a ridiculous extreme. I upvoted the question because I thought it was good. — benPearce, Jun 08 '11 at 06:04

score 26 · Accepted Answer · answered Jun 08 '11 at 05:30

For two reasons:

The string meta data (e.g. length) is stored in the same memory block as the characters, to allow one string to use part of the character data of another string would mean that you would have to allocate two memory blocks for most strings instead of one. As most strings are not substrings of other strings, that extra memory allocation would be more memory consuming than what you could gain by reusing part of strings.
There is an extra NUL character stored after the last character of the string, to make the string also usable by system functions that expect a null terminated string. You can't put that extra NUL character after a substring that is part of another string.

I suspected there would be some very good reasons for this; and sure enough, there are! Thanks for the insight. — Dan Tao, Jun 08 '11 at 05:41

score 11 · Answer 2 · answered Jun 08 '11 at 05:32

11

I believe C# strings are null terminated - while this is an implementation detail that shouldn't concern managed consumers, there are some cases (e.g. marshaling) where it's important.

Also if a substring shared a buffer with a much longer string, this means a reference to the short substring would prevent the longer string from being collected. And the possibility of a rats nest of string references that refer to the same buffer.

answered Jun 08 '11 at 05:32

Joe

122,218
32
205
338

This was also a great answer; thanks! Makes perfect sense after considering those points. – Dan Tao Jun 08 '11 at 05:43
C# strings are NOT null terminated and it's very easy to prove that. `"abc\0def".Length` is `7` and not `3` (what it would be if they were null terminated) – wischi Aug 10 '17 at 12:52
@wischi - What I meant by "null terminated" is that I think there is a null ('\0') character following the string's characters in the underlying memory buffer. Not that it is "null terminated" in the classic C sense, i.e. the string is terminated by the first null character it contains in its buffer. Guffa's answer says the same thing, but more clearly, and is rightly the accepted answer. – Joe Aug 10 '17 at 14:28

score 6 · Answer 3 · answered Jun 09 '11 at 09:05

To add to the other answers:

Apparently, the Java standard classes do this: The string returned by String.substring() reuses the internal character array of the original string (source, or look at the JDK sources by Sun).

The problem is that this means that the original String cannot be GCed until all the substrings are eligible for GC as well (as they share the backing character array). This can lead to wasted memory if you start out with a large string, and extract some smaller strings out of it, then discard the big string. That would be common when parsing an input file, for example.

Of course, a clever GC might work around this by copying the character array when it is worth it (the Sun JVM may do this, I don't know), but the added complexity might be a reason not to implement this sharing behaviour at all.

+1 to avoiding the added complexity. This is something that's been on my mind a lot lately: I think in many cases I prefer the "dumb, obvious" solution over clever, less easily provable ideas, more so than I used to. — Dan Tao, Jun 09 '11 at 14:57
@Dan Tao: Yes, just my thoughts. "Clever" is often something bad when programming. — sleske, Jun 09 '11 at 15:57

score 1 · Answer 4 · answered Jul 27 '11 at 01:48

There are a number of ways something like String could be implemented:

Have a "String" object effectively contain an array, with the implication that all characters in the array are in the string. This is what .net actually does.
Have every "String" be a class which contains an array reference along with a starting offset and length. Problem: Creating most strings would require instantiating two objects rather than one.
Have every "String" be a structure which contains an array reference along with a starting offset and length. Problem: Assignments to string type fields would no longer be atomic.
Have two or more types of "String" objects--those which contain all the characters in an array, and those which contain a reference to another string along with an offset and length. Problem: This would require many methods of string to be virtual.
Have every "String" be a special class which includes a starting offset and length, an object reference to what may or may not be the same object, and a built-in array of characters. This would waste a little space in the common case where a string contains its own characters (because all of them), but would allow the same code to work with strings that contain their own characters or strings that 'borrow' from others.
Have a general-purpose ImmutableArray<T> type (which would inherit ReadableArray<T>), and have an ImmutableArray<Char> be interchangeable with String. There are many uses for immutable arrays; String is probably the most common usage case, but hardly the only one.
Have a general-purpose ImmutableArray type<T> type as above, but also an ImmutableArraySegment<T> class, both inheriting from ImmutableArrayBase<T>. This would require many methods to be virtual, and would probably be my favorite possibility.

Note that most of these approaches have significant limitations in at least some usage scenarios.

score 0 · Answer 5 · answered Jun 08 '11 at 05:34

0

I believe these are CLR optimisations that have nothing to do with programmers as you shouldn't be doing the things you are doing. You should assume it is a new string every time (as a programmer).

answered Jun 08 '11 at 05:34

BobTurbo

289
4
14

1

Well, sure... I never said anything about *should*. I'm just curious, from a technical standpoint, why this decision was made. I think Guffa and Joe have given some great reasons. – Dan Tao Jun 08 '11 at 05:39
You are right that this is details that you shouldn't normally bother yourself with. However, there is still a value in discussing how the internals of the language is constructed for the sake of gaining a better knowledge on how it's meant to be used, so that you can avoid things that are inherently ineffective. – Guffa Jun 09 '11 at 06:31

score 0 · Answer 6 · answered Jun 08 '11 at 05:47

0

after reviewing Substring method with reflector i figured out that if you pass 0 in substriong method - it will return the same object.

[SecurityCritical]
private unsafe string InternalSubString(int startIndex, int length, bool fAlwaysCopy)
{
    if (((startIndex == 0) && (length == this.Length)) && !fAlwaysCopy)
    {
        return this;
    }
    string str = FastAllocateString(length);
    fixed (char* chRef = &str.m_firstChar)
    {
        fixed (char* chRef2 = &this.m_firstChar)
        {
            wstrcpy(chRef, chRef2 + startIndex, length);
        }
    }
    return str;
}

answered Jun 08 '11 at 05:47

vityanya

1,086
1
8
10

Yeah... this is basically what I was trying to show with my first example. The question is why when you pass a *non-zero* value, the `string` object returned does not share the same `char` values in memory with the original. – Dan Tao Jun 08 '11 at 05:49
maybe this link can help http://stackoverflow.com/questions/636932/in-c-why-is-string-a-reference-type-that-behaves-like-a-value-type – vityanya Jun 08 '11 at 05:57

score 0 · Answer 7 · answered Jun 08 '11 at 06:31

0

This would add complexity (or at least more smarts) to the intern table. Imagine you already have two entries in the intern table "pending" and "bending" and the following code:

var x = "pending";
var y = x.Substring(1);

which entry in the intern table would be considered a hit?

answered Jun 08 '11 at 06:31

Stuart

575
2
10

Neither. Strings created at runtime are not automatically interned. – Guffa Jun 08 '11 at 08:15

Why doesn't string.Substring share memory with the source string?

7 Answers7