Possibility of COW std::string implementation in C++11

Question

Today I passed by this SO question: Legality of COW std::string implementation in C++11

The most voted answer (35 upvotes) for that question says:

It's not allowed, because as per the standard 21.4.1 p6, invalidation of iterators/references is only allowed for

— as an argument to any standard library function taking a reference to non-const basic_string as an argument.

— Calling non-const member functions, except operator[], at, front, back, begin, rbegin, end, and rend.

For a COW string, calling non-const operator[] would require making a copy (and invalidating references), which is disallowed by the paragraph above. Hence, it's no longer legal to have a COW string in C++11.

I wonder whether that justification is valid or not because it seems C++03 has similar requirements for string iterator invalidation:

References, pointers, and iterators referring to the elements of a basic_string sequence may be invalidated by the following uses of that basic_string object:

As an argument to non-member functions swap() (21.3.7.8), operator>>() (21.3.7.9), and getline() (21.3.7.9).

As an argument to basic_string::swap().

Calling data() and c_str() member functions.

Calling non-const member functions, except operator[](), at(), begin(), rbegin(), end(), and rend().

Subsequent to any of the above uses except the forms of insert() and erase() which return iterators, the first call to non-const member functions operator[](), at(), begin(), rbegin(), end(), or rend().

These are not exactly the same as those of C++11's, but at least the same for the part of operator[](), which the original answer took as the major justification. So I guess, in order to justify the illegality of COW std::string implementation in C++11, some other standard requirements need to be cited. Help needed here.

That SO question has been inactive for over a year, so I've decided to raise this as a separate question. Please let me know if this is inappropriate and I will find some other way to clear my doubt.

Also relevant: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2008/n2534.html — dyp, Mar 03 '14 at 13:36
Even more so: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2668.htm — pmr, Mar 03 '14 at 13:38
Well the note below the quoted paragraph in C++03 states that the rules *intend* to allow a COW implementation.. I think the last bullet point also makes an exception *for the first call to non-member `operator[]`*, but I somehow find the formulation quite unclear. — dyp, Mar 03 '14 at 13:43
@dyp: Yes, there are indeed differences in these invalidation requirements between 03 and 11, and those differences implies an intention to make cow possible in 03 and impossible in 11, but so far I can't figure out the logic chain, at least not with what is said in the original answer of the linked SO question. I'll try to read the open-std documents that you and pmr provided to see if I could get some thoughts. — goodbyeera, Mar 03 '14 at 13:48
I *think* (but cannot prove currently) that C++03 allows the first call to `operator[]` to copy the string and invalidate any references/iterators that referred into the string on which `operator[]` has been called. — dyp, Mar 03 '14 at 13:50
Last quoted bullet point allows COW. It essentially says that after operation that invalidates references a subsequent first mutating operation is allowed to invalidate references as well. It allows deferring invalidation until string is actually modified. — zch, Mar 03 '14 at 13:50
@dyp That's the intent, at any rate. But you're right, the actual text is far from clear. (For a full explication of what's behind the C++03 wording, see the French national body comments to the CD2 of C++98, and the ensuing discussions.) — James Kanze, Mar 03 '14 at 13:54
It's interesting that while you could easily implement an operator that works even with COW (using pointer to string + offset), the reference problem cannot be solved. If only `std::string` was following Scott Meyers' advice (Item 28 of Effective C++): [Avoid returning handles to object internals](http://stackoverflow.com/questions/13176751/avoid-returning-handles-to-object-internals-so-whats-the-alternative)... — Matthieu M., Mar 03 '14 at 14:32
@MatthieuM. The reference problem can be solved; both g++ and older Sun CC used COW. The issues are 1) it doesn't buy you as much as it should, because of the leaked handles (you have to disable copy once a handle has been leaked), and 2) it is very difficult to implement efficiently in a multithreaded environment -- there's a bug in the g++ implementation, and the Sun CC implementation is (or at least was) horribly slow. — James Kanze, Mar 03 '14 at 18:54
@JamesKanze: the issue of leaked handled is why I say "not solved". If we were using a `get(size_t)`/`set(size_t, char)` approach there would be no leaked handle (and no false positive). — Matthieu M., Mar 04 '14 at 07:43

James Kanze · Accepted Answer · 2014-03-03T18:50:01.623

27

The key point is the last point in the C++03 standard. The wording could be a lot clearer, but the intent is that the first call to [], at, etc. (but only the first call) after something which established new iterators (and thus invalidated old ones) could invalidate iterators, but only the first. The wording in C++03 was, in fact, a quick hack, inserted in response to comments by the French national body on the CD2 of C++98. The original problem is simple: consider:

std::string a( "some text" );
std::string b( a );
char& rc = a[2];

At this point, modifications through rc must affect a, but not b. If COW is being used, however, when a[2] is called, a and b share a representation; in order for writes through the returned reference not to affect b, a[2] must be considered a "write", and be allowed to invalidate the reference. Which is what CD2 said: any call to a non-const [], at, or one of the begin or end functions could invalidate iterators and references. The French national body comments pointed out that this rendered a[i] == a[j] invalid, since the reference returned by one of the [] would be invalidated by the other. The last point you cite of C++03 was added to circumvent this—only the first call to [] et al. could invalidate the iterators.

I don't think anyone was totally happy with the results. The wording was done quickly, and while the intent was clear to those who were aware of the history, and the original problem, I don't think it was fully clear from standard. In addition, some experts began to question the value of COW to begin with, given the relative impossibility of the string class itself to reliably detect all writes. (If a[i] == a[j] is the complete expression, there is no write. But the string class itself must assume that the return value of a[i] may result in a write.) And in a multi-threaded environment, the cost of managing the reference count needed for copy on write was deemed a relatively high cost for something you usually don't need. The result is that most implementations (which supported threading long before C++11) have been moving away from COW anyway; as far as I know, the only major implementation still using COW was g++ (but there was a known bug in their multithreaded implementation) and (maybe) Sun CC (which the last time I looked at it, was inordinately slow, because of the cost of managing the counter). I think the committee simply took what seemed to them the simplest way of cleaning things up, by forbidding COW.

EDIT:

Some more clarification with regards to why a COW implementation has to invalidate iterators on the first call to []. Consider a naïve implementation of COW. (I will just call it String, and ignore all of the issues involving traits and allocators, which aren't really relevant here. I'll also ignore exception and thread safety, just to make things as simple as possible.)

class String
{
    struct StringRep
    {
        int useCount;
        size_t size;
        char* data;
        StringRep( char const* text, size_t size )
            : useCount( 1 )
            , size( size )
            , data( ::operator new( size + 1 ) )
        {
            std::memcpy( data, text, size ):
            data[size] = '\0';
        }
        ~StringRep()
        {
            ::operator delete( data );
        }
    };

    StringRep* myRep;
public:
    String( char const* initial_text )
        : myRep( new StringRep( initial_text, strlen( initial_text ) ) )
    {
    }
    String( String const& other )
        : myRep( other.myRep )
    {
        ++ myRep->useCount;
    }
    ~String()
    {
        -- myRep->useCount;
        if ( myRep->useCount == 0 ) {
            delete myRep;
        }
    }
    char& operator[]( size_t index )
    {
        return myRep->data[index];
    }
};

Now imagine what happens if I write:

String a( "some text" );
String b( a );
a[4] = '-';

What is the value of b after this? (Run through the code by hand, if you're not sure.)

Obviously, this doesn't work. The solution is to add a flag, bool uncopyable; to StringRep, which is initialized to false, and to modify the following functions:

String::String( String const& other )
{
    if ( other.myRep->uncopyable ) {
        myRep = new StringRep( other.myRep->data, other.myRep->size );
    } else {
        myRep = other.myRep;
        ++ myRep->useCount;
    }
}

char& String::operator[]( size_t index )
{
    if ( myRep->useCount > 1 ) {
        -- myRep->useCount;
        myRep = new StringRep( myRep->data, myRep->size );
    }
    myRep->uncopyable = true;
    return myRep->data[index];
}

This means, of course, that [] will invalidate iterators and references, but only the first time it is called on an object. The next time, the useCount will be one (and the image will be uncopyable). So a[i] == a[j] works; regardless of which the compiler actually evaluates first (a[i] or a[j]), the second one will find a useCount of 1, and will not have to duplicate. And because of the uncopyable flag,

String a( "some text" );
char& c = a[4];
String b( a );
c = '-';

will also work, and not modify b.

Of course, the above is enormously simplified. Getting it to work in a multithreaded environment is extremely difficult, unless you simply grab a mutex for the entire function for any function which might modify anything (in which case, the resulting class is extremely slow). G++ tried, and failed—there is on particular use case where it breaks. (Getting it to handle the other issues I've ignored is not particularly difficult, but does represent a lot of lines of code.)

edited Mar 03 '14 at 18:50

answered Mar 03 '14 at 13:52

James Kanze

150,581
18
184
329

1

Thank you very much for your answer. Still a bit confused. By saying, `the first call to [], at, etc. (but only the first call) **after something which established new iterators** (and thus invalidated old ones) could invalidate iterators, but only the first.`, what is the `something which established new iterators` in your example before `a[2]` is called? The construction of `a`? Possible call to `data()` when passing as the argument for `b`'s construction? – goodbyeera Mar 03 '14 at 14:19
3

Thanks for the historical perspective on this. – Matthieu M. Mar 03 '14 at 14:28
In this article: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2668.htm, it says:"Iterators and references may be invalidated by: 1.'mutating' operations and; 2.the first 'unsharing' operation following a 'mutating' operation." Here the `unsharing` operation means operations such operator[], so the article is saying the same thing as you do. But I still don't know what the `mutating` part is in your example that cause the subsequent `unsharing` `a[2]` to invalidate the iterators. Could you be more specific to that? Thanks again. – goodbyeera Mar 03 '14 at 16:21
@goodbyeera I've edited my posting to add some simple example code, which should make things clearer. – James Kanze Mar 03 '14 at 18:50
"Getting it to work in a multithreaded environment is extremely difficult," -- Could you elaborate on this? Delphi does exactly this. It has a `UniqueString` procedure that first performs a check to compare the reference count to 1. If that succeeds, then it's obvious that no other thread can hold a reference to the string. If that fails, it allocates a new region of memory, copies the original string to that new region, uses an atomic decrement instruction to lower the reference count, and if the new reference count is zero, frees the memory. Why wouldn't this work in C++? – Mar 03 '14 at 20:05
@JamesKanze: Thank you very much for adding examples for better illustration. I understand why the first call to operator[] may invalidate iterators, and I understand in your example `a[2]` needs to invalidate iterators in order to make COW implementation work correctly. My confusion is about its correspondence to the standard's requirement. (to be cont.) – goodbyeera Mar 04 '14 at 01:33
@JamesKanze: (cont.) The standard says the first call to operator[] which may invalidate iterators should be `Subsequent to any of the above uses ...`. In your example, the only two operation prior to `a[2]` is the construction of `a`, and passing `a` as the argument to `b`'s construction (through a reference to const argument). How could this be reflected to the 4 bulleted items in the C++03 standard? – goodbyeera Mar 04 '14 at 01:34
@hvd "[...] first performs a check to compare the reference count to 1. If that succeeds, then it's obvious that no other thread can hold a reference to the string." That is false. A global `std::string` can be accessed by several different threads at the same time. Thus, thread 1 calls `uniqueString`, and finds reference count 1, thread 2 makes a copy of the string, bumping the reference count to 2, thread 1 then goes on, having determined that no copy was necessary, to "isolate" the string---in the g++ implementation, this means setting the `useCount` to -1. – James Kanze Mar 04 '14 at 09:04
@hvd When the copy made by thread 2 goes out of scope, it sees that the implementation is isolated, and deletes it. Leaving the global instance pointing to a deleted copy. – James Kanze Mar 04 '14 at 09:05
@goodbyeera The list of functions in the bullet 4 corresponds to the functions which might have to reallocate anyway. Reallocation means that any previous references, pointers and iterators are invalid, so the new instance of the implementation can once again be shared. Until an internal reference leaks from it. – James Kanze Mar 04 '14 at 09:07
@JamesKanze That's a good point, but only applies if two threads access the same variable without any sort of locking, and one of them makes a change. That already isn't required to work reliably for almost any type, and in Delphi's implementation wouldn't cause any internal data corruption. Thread 1 would see the reference count is 1, and not do further checks. Thread 2 would bump the refcount. Thread 1 would continue on, thinking it's the only one to modify the referenced string, leaving the reference count as is. There's no magic -1 value there. – Mar 04 '14 at 09:18
@JamesKanze So when the variable in thread 2 goes out of scope, the refcount is decremented, it becomes 1 again, nothing is freed, and there are no dangling pointers anywhere. (In reality, there actually is a magic value too, but it means something else.) – Mar 04 '14 at 09:19
@hvd No. It does require two threads to access the same variable, but it doesn't require either of them to modify it. In other words, it doesn't meet the Posix requirements for thread safety. And I don't know about the Delphi implementation, but in C++, you clearly need something to indicate that some of the internals have leaked, allowing a mutation later. – James Kanze Mar 04 '14 at 14:16
@hvd The issue is something like: `std::string x( "..." ); char& c = x[i]; /*...*/ c = 'x';`. What happens if, during the `/*...*/`, someone makes a copy of `x`? The expression `c = 'x'` is not allowed to modify the copy. So there must be some way of inhibiting the shared implementation. – James Kanze Mar 04 '14 at 14:18
1

@JamesKanze Thank you, that is exactly the sort of example I was looking for, and also what doesn't behave as one might expect in Delphi. `var S1: string; procedure X(var C: Char); var S2: string; begin S2 := S1; C := 'x'; ShowMessage(S2); end; S1 := 'abc'; X(S1[2]);` shows `axc`. (I trust that the Delphi syntax is readable enough to be understandable.) – Mar 04 '14 at 14:45
@hvd *"doesn't behave as one might expect in Delphi"* -- Well, that arguably depends on how much one actually expects from the Borland -> Embarcadero nightmare that is the VCL. Note that `String` has a similarly broken implementation through the C++ bindings as well. It's even more evil when you set a `String` property of an object then accidentally modify it via an old reference later. – Jason C Mar 11 '14 at 16:53
1

Exposing the backing store of a modifiable string as a pointer (modifiable or not) requires that a non-sharable backing store for a string exist prior to such exposure. Would it not be legitimate to defer some of the creation costs associated with a string below 500 trillion characters until such time as the backing store needs to be exposed? Since many strings never need to have their backing store exposed, that would seem like a potential performance "win" in some cases. Creating strings bigger than 500 trillion characters would be slower than 499 trillion, but... – supercat Nov 12 '15 at 21:54
1

...the time to expose any string would be O(1), and the cost to create a string of length N and expose it M times for smaller values would be O(N)+O(M). – supercat Nov 12 '15 at 21:55
Just out of interest, why C++11 explicitly forbids invalidation of iterator by [] ? – Deqing Jul 06 '16 at 02:18

Possibility of COW std::string implementation in C++11

1 Answers1

EDIT: