Would this optimization in the implementation of std::string be allowed?

Question

I was just thinking about the implementation of std::string::substr. It returns a new std::string object, which seems a bit wasteful to me. Why not return an object that refers to the contents of the original string and can be implicitly assigned to a std::string? A kind of lazy evaluation of the actual copying. Such a class could look something like this:

template <class Ch, class Tr, class A>
class string_ref {
public:
    // not important yet, but *looks* like basic_string's for the most part

private:
    const basic_string<Ch, Tr, A> &s_;
    const size_type pos_;
    const size_type len_;    
};

The public interface of this class would mimic all of the read-only operations of a real std::string, so the usage would be seamless. std::string could then have a new constructor which takes a string_ref so the user would never be the wiser. The moment you try to "store" the result, you end up creating a copy, so no real issues with the reference pointing to data and then having it modified behind its back.

The idea being that code like this:

std::string s1 = "hello world";
std::string s2 = "world";
if(s1.substr(6) == s2) {
    std::cout << "match!" << std::endl;
}

would have no more than 2 std::string objects constructed in total. This seems like a useful optimization for code which that performs a lot of string manipulations. Of course, this doesn't just apply to std::string, but to any type which can return a subset of its contents.

As far as I know, no implementations do this.

I suppose the core of the question is:

Given a class that can be implicitly converted to a std::string as needed, would it be conforming to the standard for a library writer to change the prototype of a member's to return type? Or more generally, do the library writers have the leeway to return "proxy objects" instead of regular objects in these types of cases as an optimization?

My gut is that this is not allowed and that the prototypes must match exactly. Given that you cannot overload on return type alone, that would leave no room for library writers to take advantage of these types of situations. Like I said, I think the answer is no, but I figured I'd ask :-).

@Evan: you might want to read The C++ Programming Language - it presents an implementation of this concept. Still, such a string_ref is easily invalidating, and copying's often relatively cheap, so generally people are happy to copy into a new string. When I implement such a class, I tend to do it in terms of a const char* and size, which makes it a bit more widely applicable (e.g. overlaying it on memory mapped files, const char[] data etc). If you need the substring to be changeable independent of the source, then things quickly get complicated, particularly in threaded environments.... — Tony Delroy, Jan 13 '11 at 16:48
You don't think COW implementations of std::string don't already re-use the buffer of the original string when you do substr(). — Martin York, Jan 13 '11 at 18:52
@Matthieu: yes, very like llvm::StringRef - I'm sure every experienced performance-conscious C++ programmer's thrown a version or three together... — Tony Delroy, Jan 14 '11 at 01:24
@Martin York: I'm not aware of any COW implementations that worked for substrings of the original string. That doesn't mean they aren't out there, just that I haven't seem that. — Evan Teran, Jan 14 '11 at 04:02
@Tony: Thanks, I actually have a copy, but don't recall that section, I'll have to check that out. — Evan Teran, Jan 14 '11 at 04:03
@Evan: actually, COW is nowadays regarded as a bad idea because of multi-threading. The overhead associated with synchronizing the accesses to the shared buffer are annoying. Also it's not as charming as it looks like: a number of methods grant access to the internals (`operator[]`, `at` in their non-const version for example), which actually require a copy ahead of time, even if it won't be useful. — Matthieu M., Jan 14 '11 at 07:33

GManNickG · Accepted Answer · 2011-01-13T17:28:42.630

6

This idea is copy-on-write, but instead of COW'ing the entire buffer, you keep track of which subset of the buffer is the "real" string. (COW, in its normal form, was (is?) used in some library implementations.)

So you don't need a proxy object or change of interface at all because these details can be made completely internal. Conceptually, you need to keep track of four things: a source buffer, a reference count for the buffer, and the start and end of the string within this buffer.

Anytime an operation modifies the buffer at all, it creates its own copy (from the start and end delimiters), decreases the old buffer's reference count by one, and sets the new buffer's reference count to one. The rest of the reference counting rules are the same: copy and increase count by one, destruct a string and decrease count by one, reach zero and delete, etc.

substr just makes a new string instance, except with the start and end delimiters explicitly specified.

edited Jan 13 '11 at 17:28

answered Jan 13 '11 at 16:49

GManNickG

494,350
52
494
543

1

Copy-on-write or a subblock of the whole memory? I wonder if any library actually supported that (as compared to copy-on-write of the whole string) – David Rodríguez - dribeas Jan 13 '11 at 16:53
@David: It's COW, but tracking which subset of the memory to actually copy. And good question, I don't think I've ever heard of one. The overhead of COW itself isn't generally worth it these days. – GManNickG Jan 13 '11 at 16:55
@GMan: it would require some hell of a machinery to keep track of all the intervals (take a peek at IntervalTree on Wikipedia), so I think it would be easier (and actually faster) to actually COW the whole thing... unless std::string was allowed to have a non-contiguous layout (like the STL `rope` class), in which case you could maintain the count by block I guess. – Matthieu M. Jan 13 '11 at 17:47
@Matt: Where does more than one interval come in? I'm afraid there's a problem looming over me I don't see. – GManNickG Jan 13 '11 at 17:53
@GMan: when the same base string issue several substrings over different parts of its buffer. Example: with `std::string str = "abcdef";` we do `ref sub1 = str.substr(0,4); // == "abcd"` and `ref sub2 = str.substr(2,4); // == "cdef"`. At this point you have overlapping intervals being referenced from two distinct objects. – Matthieu M. Jan 14 '11 at 07:15
@Matt: So then each is referring to the same buffer, with a ref. count of three. `sub1` has start and end equal to buf and buf + 4, and `sub2` has start and end at buf + 2 and buf + 4. A modification of buf by any of these makes their own copy with the range start-end. Doesn't seem *too* complicated. – GManNickG Jan 14 '11 at 16:23
@GMan: no, it's not complicated, but I'd though you'd want to track each interval separately so has to be able not to copy the strings when changes occur on an interval that you're the only one to have reference to. – Matthieu M. Jan 14 '11 at 16:34
@Matt: I think I almost see where you're coming from but not quite. :) If a string made a mutation and when it decreased the ref count it was zero, then it can just re-use that buffer and increase it back to one, though now it has extra memory lying around. – GManNickG Jan 14 '11 at 16:44
@GMan: not totally, I mean that in the example I referenced above, if `sub1` and `sub2` happened to be on disjoints intervals, and `str` dropped out of scope, then one could hope that `sub1` (now the only one to be able to access the "shared" block of memory between its indices) could modify its "slice" of the memory in place, even though `sub2` refers to the same block, but a different slice. – Matthieu M. Jan 14 '11 at 17:41
@Matt: Oh, got it. That sure is complex. :) – GManNickG Jan 14 '11 at 17:44

score 3 · Answer 2 · answered Jan 13 '11 at 17:01

This is a quite well-known optimization that is relatively widely used, called copy-on-write or COW. The basic thing is not even to do with substrings, but with something as simple as

s1 = s2;

Now, the problem with this optimization is that for C++ libraries that are supposed to be used on targets supporting multiple threads, the reference count for the string has to be accessed using atomic operations (or worse, protected with a mutex in case the target platform doesn't supply atomic operations). This is expensive enough that in most cases the simple non-COW string implementation is faster.

See GOTW #43-45:

http://www.gotw.ca/gotw/043.htm

http://www.gotw.ca/gotw/044.htm

http://www.gotw.ca/gotw/045.htm

To make matters worse, libraries that have used COW, such as the GNU C++ library, cannot simply revert to the simple implementation since that would break the ABI. (Although, C++0x to the rescue, as that will require an ABI bump anyway! :) )

score 1 · Answer 3 · answered Jan 13 '11 at 16:49

Since substr returns std::string, there is no way to return a proxy object, and they can't just change the return type or overload on it (for the reasons you mentioned).

They could do this by making string itself capable of being a sub of another string. This would mean a memory penalty for all usages (to hold an extra string and two size_types). Also, every operation would need to check to see if it has the characters or is a proxy. Perhaps this could be done with an implementation pointer -- the problem is, now we're making a general purpose class slower for a possible edge case.

If you need this, the best way is to create another class, substring, that constructs from a string, pos, and length, and coverts to string. You can't use it as s1.substr(6), but you can do

 substring sub(s1, 6);

You would also need to create common operations that take a substring and string to avoid the conversion (since that's the whole point).

score 0 · Answer 4 · answered Jan 13 '11 at 16:59

0

Regarding your specific example, this worked for me:

if (&s1[6] == s2) {
    std::cout << "match!" << std::endl;
}

That may not answer your question for a general-purpose solution. For that, you'd need sub-string CoW, as @GMan suggests.

answered Jan 13 '11 at 16:59

chrisaycock

36,470
14
88
125

Daniel Trebbien · Answer 5 · 2011-01-13T17:23:24.053

What you are talking about is (or was) one of the core features of Java's java.lang.String class (http://fishbowl.pastiche.org/2005/04/27/the_string_memory_gotcha/). In many ways, the designs of Java's String class and C++'s basic_string template are similar, so I would imagine that writing an implementation of the basic_string template utilizing this "substring optimization" is possible.

One thing that you will need to consider is how to write the implementation of the c_str() const member. Depending on the location of a string as a substring of another, it may have to create a new copy. It definitely would have to create a new copy of the internal array if the string for which the c_str was requested is not a trailing substring. I think that this necessitates using the mutable keyword on most, if not all, of the data members of the basic_string implementation, greatly complicating the implementation of other const methods because the compiler is no longer able to assist the programmer with const correctness.

EDIT: Actually, to accommodate c_str() const and data() const, you could use a single mutable field of type const charT*. Initially set to NULL, it could be per-instance, initialized to a pointer to a new charT array whenever c_str() const or data() const are called, and deleted in the basic_string destructor if non-NULL.

score 0 · Answer 6 · answered Jan 13 '11 at 17:39

If and only if you really need more performance than std::string provides then go ahead and write something that works the way you need it to. I have worked with variants of strings before.

My own preference is to use non-mutable strings rather than copy-on-write, and to use boost::shared_ptr or equivalent but only when the string is actually beyond 16 in length, so the string class also has a private buffer for short strings.

This does mean that the string class might carry a bit of weight.

I also have in my collection list a "slice" class that can look at a "subset" of a class that lives elsewhere as long as the lifetime of the original object is intact. So in your case I could slice the string to see a substring. Of course it would not be null-terminated, nor is there any way of making it such without copying it. And it is not a string class.

Would this optimization in the implementation of std::string be allowed?

6 Answers6