36

I'm writing a language interpreter in C, and my string type contains a length attribute, like so:

struct String
{
    char* characters;
    size_t length;
};

Because of this, I have to spend a lot of time in my interpreter handling this kind of string manually since C doesn't include built-in support for it. I've considered switching to simple null-terminated strings just to comply with the underlying C, but there seem to be a lot of reasons not to:

Bounds-checking is built-in if you use "length" instead of looking for a null.

You have to traverse the entire string to find its length.

You have to do extra stuff to handle a null character in the middle of a null-terminated string.

Null-terminated strings deal poorly with Unicode.

Non-null-terminated strings can intern more, i.e. the characters for "Hello, world" and "Hello" can be stored in the same place, just with different lengths. This can't be done with null-terminated strings.

String slice (note: strings are immutable in my language). Obviously the second is slower (and more error-prone: think about adding error-checking of begin and end to both functions).

struct String slice(struct String in, size_t begin, size_t end)
{
    struct String out;
    out.characters = in.characters + begin;
    out.length = end - begin;

    return out;
}

char* slice(char* in, size_t begin, size_t end)
{
    char* out = malloc(end - begin + 1);

    for(int i = 0; i < end - begin; i++)
        out[i] = in[i + begin];

    out[end - begin] = '\0';

    return out;
}

After all this, my thinking is no longer about whether I should use null-terminated strings: I'm thinking about why C uses them!

So my question is: are there any benefits to null-termination that I'm missing?

Pete Kirkham
  • 48,893
  • 5
  • 92
  • 171
Imagist
  • 18,086
  • 12
  • 58
  • 77
  • Since malloc() is so expensive in C, I suggest to use this structure: struct String { size_t length; char[1] characters; } Just allocate strlen(s)+1+sizeof(size_t) or strlen(s)+sizeof(String) bytes and copy the string to the address &characters. – Aaron Digulla Aug 10 '09 at 08:07
  • It's simple. That's the benefit. – Mike Dunlavey Oct 23 '16 at 13:46

10 Answers10

33

From Joel's Back to Basics:

Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

Community
  • 1
  • 1
weiqure
  • 3,247
  • 2
  • 27
  • 31
  • 20
    Denis Ritchie opinion is somewhat different. BCPL had a length+content representation, with the length contained in one byte. B switched to terminated string "partially to avoid the limitation on the length of a string caused by holding the count in an 8 or 9 bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator." (The Development of the C Language, http://cm.bell-labs.com/cm/cs/who/dmr/chist.pdf) – AProgrammer Aug 10 '09 at 06:40
19

The usual solution is to do both - keep the length and maintain the null terminator. It's not much extra work and means that you are always ready to pass the string to any function.

Null-terminated strings are often a drain on performance, for the obvious reason that the time taken to discover the length depends on the length. On the plus side, they are the standard way of representing strings in C, so you have little choice but to support them if you want to use most C libraries.

Daniel Earwicker
  • 114,894
  • 38
  • 205
  • 284
  • 1
    This is what Lua does. It makes interfacing to C for normal use cases very clean, and still supports arbitrary length binary buffers. – RBerteig Aug 10 '09 at 06:36
  • 3
    It's what most things do! You don't even have to maintain the null terminator all the time - just do `str[len] = '\0'` whenever you need to. This is what `std::string::c_str`` usually does in C++. – Daniel Earwicker Aug 10 '09 at 06:40
  • By most things, I mean most string classes, and most interpreter string representations. One widely-used example on Windows is the BSTR type. – Daniel Earwicker Aug 10 '09 at 06:41
  • This is exactly why I asked this question; I thought I might be missing some solution. It seems obvious now but I didn't think of this! – Imagist Aug 10 '09 at 07:29
  • Cool! Well, you see that green check icon next to my answer...? – Daniel Earwicker Aug 10 '09 at 07:34
  • @Earwicker I was planning on checking that, I just like to wait a few days. There are few things more annoying to me than people who accept an answer within minutes of asking the question. – Imagist Aug 11 '09 at 12:31
  • 1
    :) It means more if you have to wait for something, I think! – Daniel Earwicker Aug 11 '09 at 14:32
9

One advantage of nul-terminated strings is that if you are walking through a string character-by-character, you only need to keep a single pointer to address the string:

while (*s)
{
    *s = toupper(*s);
    s++;
}

whereas for strings without sentinels, you need to keep two bits of state around: either a pointer and index:

while (i < s.length)
{
    s.data[i] = toupper(s.data[i]);
    i++;
}

...or a current pointer and a limit:

s_end = s + length;
while (s < s_end)
{
    *s = toupper(*s);
    s++;
}

When CPU registers were a scarce resource (and compilers were worse at allocating them), this was important. Now, not so much.

caf
  • 233,326
  • 40
  • 323
  • 462
  • 4
    "When CPU registers were a scarce resource" - registers are still a scarce resource on x86 and x64. – Jimmy Aug 10 '09 at 06:49
  • I don't get it; if I'm storing string in the `struct` example I gave, why can't I use that as the limit? – Imagist Aug 10 '09 at 07:56
  • 1
    The point is that during a processing loop like the above, length-based strings like yours end up using two registers for string book-keeping whereas sentinel-based strings like idiomatic C strings only use one (the other one is obtained "for free" because the character values are being loaded in order to process them anyway). – caf Aug 10 '09 at 09:03
8

Lengths have their problems too.

  • The length takes extra storage (not such an issue now, but a big factor 30 years ago).

  • Every time you alter a string you have to update the length, so you get reduced performance across the board.

  • With a NUL-terminated string you can still use a length or store a pointer to the last character, so if you are doing lots of string manipulations, you can still equal the performance of string-with-length.

  • NUL-terminated strings are much simpler - The NUL terminator is just a convention used by methods like strcat to determine the end of the string. So you can store them in a regular char array rather than having to use a struct.

Jason Williams
  • 56,972
  • 11
  • 108
  • 137
  • 1
    Extra storage can still be a big issue for embedded systems, which is one reason to stress keeping the language light-weight. – Jimmy Aug 10 '09 at 06:45
  • 4
    @Jimmy My issue there is: on such embedded systems, why are you using strings? I don't think I even ever used a `char` back when I was doing robotics programming. The only example I can think of is if you're programming for an LED display (like those scrolling text things or the ones on soft drink machines) but the functionality there is so simple that I still have a difficult time imagining the extra 3 bytes being a problem (4 byte int - 1 byte since you don't have to store the null character). – Imagist Aug 10 '09 at 07:34
  • What you suggest is not that trivial btw. Who will take ownership of the buffer? Will the newly created temporary string do that? I doubt you want this and then you need a way to have such non-owning strings to avoid copying. – sharptooth Aug 10 '09 at 08:17
7

One benefit is that with null-termination any tail of a null-terminated string is also a null-terminated string. If you need to pass a substring starting with Nth character (provided there's no buffer overrun) into some string-handling function - no problem, just pass the offseeted address there. When storing size in some other way you would need to construct a new string.

sharptooth
  • 167,383
  • 100
  • 513
  • 979
  • Can you give an example of a string of that you might want to print the end of the string? – weiqure Aug 10 '09 at 06:12
  • That can be used when concatenating strings - you could want to append not the entire string, but only the tail of it. Then you call strcat( target, source + offset); - and it's done. – sharptooth Aug 10 '09 at 06:16
  • Take a front trim of white space. You can determine the first non-white space char and instead of actually changing the string, you can just pass the starting offset in, saving you either allocating new memory, or copying data. – Dan McGrath Aug 10 '09 at 06:17
  • 3
    That's not all that different for what I do with my struct: `struct String new; new.characters = old.characters + offset; new.length = old.length - offset;` It's a bit of bookkeeping but comes out to what, 5 instructions? This seems trivial compared to the difference if you needed to do something to the beginning of the string instead of the end. – Imagist Aug 10 '09 at 08:02
  • It makes it really easy to do things like recursive string matching, spelling correction, etc., if you can treat the string like you would a list in Lisp. – Mike Dunlavey Aug 13 '09 at 00:57
6

Slightly offtopic, but there's a more efficient way to do length-prefixed strings than the way you describe. Create a struct like this (valid in C99 and up):

struct String 
{
  size_t length;
  char characters[0];
}

This creates a struct that has the length at the start, with the 'characters' element usable as a char* just as you would with your current struct. The difference, however, is that you can allocate only a single item on the heap for each string, instead of two. Allocate your strings like this:

mystr = malloc(sizeof(String) + strlen(cstring))

Eg - the length of the struct (which is just the size_t) plus enough space to put the actual string after it.

If you don't want to use C99, you can also do this with "char characters[1]" and subtract 1 from the length of the string to allocate.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198
4

Just throwing out some hypotheticals:

  • there's no way to get a "wrong" implementation of null terminated strings. A standardized struct however could have vendor-specific implementations.
  • no structs are required. Null terminated strings are "built-in" so to speak, by virtue of being a special case of a char*.
Jimmy
  • 27,142
  • 5
  • 87
  • 100
2

Although I prefer the array + len method in most cases, there are valid reasons for using null-terminated.

Take a 32-bit system.

To store a 7 byte string
char * + size_t + 8 bytes = 19 bytes

To store a 7 byte null-term string
char * + 8 = 16 bytes.

null-term arrays don't need to be immutable like your strings do. I can happily truncate the c-string by simply places a null char. If you code, you would need to create a new string, which involves allocating memory.

Depending on the usage of the strings, your strings will never be able to match the performance possible with c-strings as opposed to your strings.

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
  • 2
    You can truncate a string-with-length by just reducing the length. Usually this means you have two lengths - the current length of the string plus the amount of memory you have currently allocated for the string. This allows it to be dynamically resized without having to realloc on every modification. – Jason Williams Aug 10 '09 at 06:17
  • 3
    Indeed, that is the way to go; I had based my answer on the op's string struct, which would enable you to reduce the length, but never to utilise that space again. Ropes are another interesting way to handle strings. http://en.wikipedia.org/wiki/Rope_(computer_science) – Dan McGrath Aug 10 '09 at 06:52
  • A few questions: assuming a byte is 8 bits, shouldn't a 32-bit system have `sizeof(size_t) == 4` and `sizeof(char*) == 4`? And with my method you don't have to use 8 characters for the first method. So I'm getting: 4 + 4 + 7 = 15 for my method, and 4 + 8 = 12 for the null-terminating method. I'm not contesting your point, just the math that leads to your point. – Imagist Aug 10 '09 at 07:50
  • @Dan McG (comment) My strings will be stored in a garbage-collected system, which will allow me to reclaim the space. And my interpreter uses ropes internally; the garbage collector will use idle cycles to flatten ropes into strings. – Imagist Aug 10 '09 at 07:54
2

You're absolutely right that 0-termination is a method which is poor with respect to type checking and performance for part of the operations. The answers on this page already summarize the origins and uses for it.

I liked the way Delphi stored strings. I believe it maintains a length/maxlength in before the (variable length) string. This way the strings can be null-terminated for compatibility.

My concerns with your mechanism: - additional pointer - immutability si in the core parts of your language; normally string types are not immutable so if you ever reconsider than it'll be tough. You'd need to implement a 'create copy on change' mechanism - use of malloc (hardly efficient, but may be included here just for ease?)

Good luck; writing your own interpreter can be very educational in understanding mainly the grammar and syntax of programming languages! (at least, it ws for me)

Adriaan
  • 3,282
  • 1
  • 23
  • 31
0

I think main reason is that standard says nothing concrete about size of any type other than char. But sizeof(char) = 1 and that is definitely not enough for string size.

Kirill V. Lyadvinsky
  • 97,037
  • 24
  • 136
  • 212
  • It's enough for 2^CHAR_BIT characters per string. – Daniel Earwicker Aug 10 '09 at 06:17
  • It is only 255 characters. Too little. – Kirill V. Lyadvinsky Aug 10 '09 at 06:31
  • No, it's enough for 2^CHAR_BIT - 1 characters per string, and if you've never had to deal with strings longer than 255 characters then you've only had to program for a very limited problem domain. However C *does* say concrete things about other integral types - for example int has to have a range of at least -32767 to +32767. In particular C says that size_t must be able to hold the size of any object, so that would be fine as a standardised string length. – caf Aug 10 '09 at 06:35
  • @caf, you aren't thinking creatively enough if you can't see how to represent that extra unit of character length. :) – Daniel Earwicker Aug 10 '09 at 07:32
  • @Jla3ep Did you look at my code sample? I used size_t to store the length for a reason. – Imagist Aug 10 '09 at 07:37
  • @caf I think Earwicker means that you can have the `characters` pointer point to `NULL`. – Imagist Aug 10 '09 at 07:39
  • ...for a zero-length string (I hate that you can't edit comments!). – Imagist Aug 10 '09 at 07:39
  • @Imagist, you asked why C uses null-terminate strings. I believe that size of `size_t` was equal to 1 in the beginning. – Kirill V. Lyadvinsky Aug 10 '09 at 08:04