Can a C implementation use length-prefixed-strings "under the hood"?

Question

After reading this question: What are the problems of a zero-terminated string that length-prefixed strings overcome? I started to wonder, what exactly is stopping a C implementation from allocating a few extra bytes for any char or wchar_t array allocated on the stack or heap and using them as a "string prefix" to store the number N of its elements?

Then, if the N-th character is '\0', N - 1 would signify the string length.

I believe this could mightily boost performance of functions such as strlen or strcat.

This could potentially turn to extra memory consumption if a program uses non-0-terminated char arrays extensively, but that could be remedied by a compiler flag turning on or off the regular "count-until-you-reach-'\0'" routine for the compiled code.

What are possible obstacles for such an implementation? Does the C Standard allow for this? What problems can this technique cause that I haven't accounted for?

And... has this actually ever been done?

There are no strings in C. Just arrays of chars. That pretty much stops you from doing that kind of optimization. — RedX, May 26 '15 at 15:53
@EvdzhanMustafa so where does that leave the "end of string marker"? — harold, May 26 '15 at 15:53
If you decide this implementation is better than standard one for your application - just implement it. No obstacles here. — Eugene Sh., May 26 '15 at 15:54
@RedX: there are. That's what `'\0'`-terminated `char` arrays are called. They even have about 5 SO tags dedicated to them =) — , May 26 '15 at 15:54
`struct StringWithExtraData { size_t curlen; size_t totalbytes; char *data };` — pmg, May 26 '15 at 15:57
The Most Expensive One-byte Mistake http://queue.acm.org/detail.cfm?id=2010365 — Alvein, May 26 '15 at 15:58
I personally believe the problem is that people today still stick with C, instead of moving to C++ or any other suitable higher level language. Why would you like to implement something similar to std::string in C, when there is already std::string in C++? — Flovdis, May 26 '15 at 16:22
@Mints97 A char array terminated by `\0` is just a convention. If you want to use `str*` functions your char array must follow that convention. But nobody is stopping you from creating your own set of functions that follows a different paradigm. — RedX, May 26 '15 at 16:31

score 5 · Accepted Answer · answered May 26 '15 at 15:54

You can store the length of the allocation. And malloc implementations really do do that (or some do, at least).

You can't reasonably store the length of whatever string is stored in the allocation, though, because the user can change the contents to their whim; it would be unreasonable to keep the length up to date. Furthermore, users might start strings somewhere in the middle of the character array, or might not even be using the array to hold a string!

score 4 · Answer 2 · answered May 26 '15 at 16:24

Then, if the N-th character is '\0', N - 1 would signify the string length.

Actually, no, and that's why this suggestion cannot work.

If I overwrite a character in a string with a 0, I have effectively truncated the string, and a subsequent call of strlen on the string must return the truncated length. (This is commonly done by application programs, including every scanner generated by (f)lex, as well as the strtok standard library function. Amongst others.)

Moreover, it is entirely legal to call strlen on an interior byte of the string.

For example (just for demonstration purposes, although I'll bet you can find code almost identical to this in common use.)

/* Split a string like 'key=value...' into key and value parts, and
 * return the value, and optionally its length (if the second argument
 * is not a NULL pointer). 
 * On success, returns the value part and modifieds the original string
 * so that it is the key.
 * If there is no '=' in the supplied string, neither it nor the value
 * pointed to by plen are modified, and NULL is returned.
 */
char* keyval_split(char* keyval, int* plen) {
  char* delim = strchr(keyval, '=');
  if (delim) {
    if (plen) *plen = strlen(delim + 1)
    *delim = 0;
    return delim + 1;
  } else {
    return NULL;
  }
}

"If I overwrite a character in a string with a 0, I have effectively truncated the string"... geez, you're right! I am an idiot. Well, down the drain goes my "brilliant" idea XD Thanks! — , May 26 '15 at 18:03

score 3 · Answer 3 · answered May 26 '15 at 16:09

There's nothing fundamentally stopping you from doing this in your application, if that was useful (one of the comments noted this). There are two problems that would emerge, however:

You'd have to reimplement all the string-handling functions, and have my_strlen, my_strcpy, and so on, and add string-creating functions. That might be annoying, but it's a bounded problem.
You'd have to stop people, when writing for the system, deliberately or automatically treating the associated character arrays as ‘ordinary’ C strings, and using the usual functions on them. You might have to make sure that such usages broke promptly.

This means that it would, I think, be infeasible to smuggle a reimplemented ‘C string’ into an existing program.

Something like

typedef struct {
    size_t len;
    char* buf;
} String;
size_t my_strlen(String*);
...

might work, since type-checking would frustrate (2) (unless someone decided to hack things ‘for efficiency’, in which case there's not much you can do).

Of course, you wouldn't do this unless and until you'd proven that string management was the bottleneck in your code and that this approach provably improved things....

score 1 · Answer 4 · answered May 26 '15 at 16:29

There are a couple of issues with this approach. First of all, you wouldn't be able to create arbitrarily long strings. If you only reserve 1 byte for length, then your string can only go up to 255 characters. You can certainly use more bytes to store the length, but how many? 2? 4?

What if you try to concatenate two strings that are both at the edge of their size limits (i.e., if you use 1 byte for length and try to concatenate two 250-character strings to each other, what happens)? Do you simply add more bytes to the length as necessary?

Secondly, where do you store this metadata? It somehow has to be associated with the string. This is similar to the problem Dennis Ritchie ran into when he was implementing arrays in C. Originally, array objects stored an explicit pointer to the first element of the array, but as he added struct types to the language, he realized that he didn't want that metadata cluttering up the representation of the struct object in memory, so he got rid of it and introduced the rule that array expressions get converted to pointer expressions in most circumstances.

You could create a new aggregate type like

struct string
{
  char *data;
  size_t len;
};

but then you wouldn't be able to use the C string library to manipulate objects of that type; an implementation would still have to support the existing interface.

You could store the length in the leading byte or bytes of the string, but how many do you reserve? You could use a variable number of bytes to store the length, but now you need a way to distinguish length bytes from content bytes, and you can't read the first character by simply dereferencing the pointer. Functions like strcat would have to know how to step around the length bytes, how to adjust the contents if the number of length bytes changes, etc.

The 0-terminated approach has its disadvantages, but it's also a helluva lot easier to implement and makes manipulating strings a lot easier.

I would favor having a variable-length prefix, a function which, given a pointer to the start of the prefix, would produce a structure that included a special header, a pointer to the text, the size of the allocation, and the length of the text within the allocation, and a function which given a pointer to the start of the prefix and a new length, would update the length of the stored string appropriately. A function like `strcat` would pass each argument to the "get string info" function, determine how much of the source string would fit, copy the data, compute the new length, and... — supercat, May 26 '15 at 21:05
...call the function to update the length stored in the destination string. If the aforementioned string-info structure's header was distinct from any other prefix, functions that would expect string pointers could just as happily accept pointers to the aforementioned structures. If code has a long string and wishes to e.g. concatenate a portion of it onto another, it could build a structure identifying the portion and pass that structure as the "source string" argument to the same string-concatenation function as would be used to concatenate an entire string. — supercat, May 26 '15 at 21:08

score 1 · Answer 5 · answered May 26 '15 at 16:42

The string methods in the standard library have defined semantics. If one generates an array of char that contains various values, and passes a pointer to the array or a portion thereof, the methods whose behavior is defined in terms of NUL bytes must search for NUL bytes in the same fashion as defined by the standard.

One could define one's own methods for string handling which use a better form of string storage, and simply pretend that the standard library string-related functions don't exist unless one must pass strings to things like fopen. The biggest difficulty with such an approach is that unless one uses non-portable compiler features it would not be possible to use in-line string literals. Instead of saying:

ns_output(my_file, "This is a test"); // ns -- new string

one would have to say something more like:

MAKE_NEW_STRING(this_is_a_test, "This is a test");
ns_output(my_file, this_is_a_test);

where the macro MAKE_NEW_STRING would create a union of an anonymous type, define an instance called this_is_a_test, and suitably initialize it. Since a lot of strings would be of different anonymous types, type-checking would require that strings be unions that include a member of a known array type, and code expecting strings should be given a pointer that member, likely using something like:

#define ns_output(f,s) (ns_output_func((f),(s).stringref))

It would be possible to define the types in such a way as to avoid the need for the stringref member and have code just accept void*, but the stringref member would essentially perform static duck-typing (only things with a stringref member could be given to such a macro) and could also allow type-checking on the type of stringref itself).

If one could accept those constraints, I think one could probably write code that was more efficient in almost every way that zero-terminated strings; the question would be whether the advantages would be worth the hassle.

Can a C implementation use length-prefixed-strings "under the hood"?

5 Answers5