Just generally speaking, if you showed me a series of reasonably well-written C programs and a series of reasonably well-written C++ programs and asked me which one had more efficient string operations, my bet would be on the C programs.
And this is coming from a C++ enthusiast. But there are a lot of tendencies in C++ programs for people to use a lot more memory and/or incur a whole lot more heap allocations than necessary when working with strings, and that tends to more than outweigh the extra cache-friendly sequential passes that your C program might do with some extra calls to strlen
when it could have kept the size of the string around.
As a basic example, a C++ developer wanting to store a large number of strings for random access might do this:
std::vector<std::string> boatload_of_strings;
... and this either incurs far more heap allocations than necessary or uses a boatload more memory than necessary. With most modern compilers implementing small strings optimizations, even storing an entry for "a" or "I" could end up storing 24 bytes just for that one-character string when you only need 2 bytes. Meanwhile the C programmer lacking such conveniences might store this:
// One contiguous buffer for all strings, realloced when
// capacity is exceeded.
char* boatload_of_strings;
// Starting position of each string.
int* string_start;
... with the nth null-terminated string accessed like this:
const char* nth_string = boatload_of_strings + string_start[n];
And that's so much more efficient: more cache-friendly, less memory used, etc. Of course it takes more time to write and is more error-prone, and if the question is about productivity working with strings rather than computational/memory efficiency, I'd quickly change my vote to C++. Of course a C++ developer could also do:
// One contiguous buffer for all strings.
vector<char> boatload_of_strings;
// Starting position of each string.
vector<int> string_start;
... and that's a very efficient way to represent a boatload of strings that need to be accessed randomly. But this subject is about tendencies, and I think most C++ developers are more likely to use std::string
here than a vector
of char
. All we can talk about is tendencies because a C programmer can also store a string length in a struct
. A C programmer could also do something very inefficient and store char** boatload_of_strings
and allocate every single teeny string separately. We're just talking about what people tend to do, and given what I've seen people tend to do in these two languages, my bet is on C programs to generally have an efficiency edge with strings.
There is a good point that the lack of C strings keeping track of length could lead to more linear passes than necessary, but again those are cache-friendly linear passes through a contiguous buffer. It'd be somewhat like arguing that a linked list which allocates every single node separately against a general-purpose allocator is more efficient for push_backs
than std::vector
because it doesn't have to reallocate and copy a buffer in linear-time. Memory efficiency trumps some extra linear passes here and there, and std::string
is never going to be perfectly optimal from a memory efficiency standpoint, especially when used to store persistent data. It either uses too much memory for really small strings with the small string optimization, or uses too many heap allocations for medium-sized strings since the small string optimization favored a really tiny buffer.
There might be some genuine cases where C++ does hold an edge in practical use cases, but most of the times I have gotten major performance boosts replacing C++ code using std::string
with plain old character buffers, not the other way around as you have also found. It'd be rare for me to encounter a genuine case, measured and profiled accordingly before and after, where replacing competently-written C code using character buffers results in a performance boost after using, say, std::string
or std::wstring
.
strlen
Another crucial thing to keep in mind is that strlen
is often implemented in a very efficient way. For example, MSVC treats it as a compiler intrinsic along with functions like memset
. They don't treat these as regular function calls and instead generate very efficient instructions for these of a kind that is far more efficient than if you just hand-rolled a basic loop counting the number of characters in a string before reaching a null terminator.
So not only is it a cache-friendly sequential loop through a contiguous buffer, but one that has been optimized to death. I have never, ever seen the use of strlen
show up as a hotspot in any profiling session on any codebase whatsoever. I have definitely seen my share of std::string
and QString
-related hotspots in VTune.
[...] does not use the free store (dynamic memory) for short argument
strings.
I don't know what type of C programs Bjarne was looking at but typically most C programs I see don't utilize heap allocations for small strings. They often use buffers on the stack like this:
char buf[256];
... which isn't very robust but definitely not going to be invoking heap allocations, or using VLAs with C99 like this:
char buf[n];
... which risks stack overflows but, again, isn't invoking needless heap allocations, or something like this:
char buf[256];
char* data = (n < 256) ? buf: malloc(n+1);
...
if (data != buf)
free(data);
... which is the most robust and still avoids the heap allocation in the common cases. Besides that, people have been touting that std::string
is faster than your average C code for ages, long before small string optimizations, and even back when most std::string
implementations used copy-on-write. And the real world results never quite matched up to those claims in this one case.
Compose Example
Okay so phew, getting to the compose
example:
char* compose(const char* name, const char* domain)
{
char* res = malloc(strlen(name)+strlen(domain)+2); // space for strings, '@', and 0
char* p = strcpy(res,name);
p += strlen(name);
*p = '@';
strcpy(p+1,domain);
return res;
}
First of all, I don't encounter people writing C code like this so often with a function that heap-allocates a string and returns a pointer to it for the client to free in production code. Far more often I'm seeing people do things like:
char buf[512];
sprintf(buf, "%s@%s", name, domain);
... which again isn't the safest code, but definitely not one that incurs more heap-allocations than the other and it doesn't have to do an extra pass to determine the length of these strings since the buffer is already pre-sized. But if we dissect the C++ version:
string compose(const string& name, const string& domain)
{
return name+'@'+domain;
}
string::operator+
can potentially get away with one less linear pass through these two strings because they store a size, but if these strings are teeny, that's such a trivial overhead. It's saving pennies but in exchange for a cost. If these strings aren't teeny, then the small string optimization doesn't help, it actually hurts and causes more memory waste, and you'd still end up with a heap allocation. The above solution is way more robust than the sprintf
solution using a fixed-sized buffer, but here I'm just talking about efficiency against common tendencies.
Just very generally speaking, doing an additional linear passe through contiguous data to determine size is often cheaper than alternatives that could potentially require more/bigger heap allocations. For example, if you do this:
int count = 0;
// count how many elements there are:
for (...)
{
...
++count;
}
// Now size the vector accordingly:
vector<int> values(count);
// Do a second pass through the same data.
for (...)
...
... this often tends to be more efficient than:
vector<int> values;
// Do a single pass through the data with push_backs.
for (...)
...
And a similar principle applies to strings. More linear passes through a string isn't necessarily more expensive if it results in less memory use, less heap allocations, etc.