23

Just wondering why this is the case. I'm eager to know more about low level languages, and I'm only into the basics of C and this is already confusing me.

Do languages like PHP automatically null terminate strings as they are being interpreted and / or parsed?

alex
  • 479,566
  • 201
  • 878
  • 984
  • Related or possibly duplicate: http://stackoverflow.com/questions/1253291/why-null-terminated-strings-or-null-terminated-vs-characters-length-storage –  Feb 08 '10 at 12:01
  • Just found this too: http://stackoverflow.com/questions/2037209/what-is-a-null-terminated-string – alex Feb 08 '10 at 12:22
  • 1
    There are two common methods for representing text: 1. Specifying the length, followed by the text. Or using a terminating character. Many databases use the former. One question to ask yourself, is "How is the end of a string determined?" – Thomas Matthews Feb 08 '10 at 23:02

9 Answers9

31

From Joel's excellent article on the topic:

Remember the way strings work in C: they consist of a bunch of bytes followed by a null character, which has the value 0. This has two obvious implications:

There is no way to know where the string ends (that is, the string length) without moving through it, looking for the null character at the end. Your string can't have any zeros in it. So you can't store an arbitrary binary blob like a JPEG picture in a C string. Why do C strings work this way? It's because the PDP-7 microprocessor, on which UNIX and the C programming language were invented, had an ASCIZ string type. ASCIZ meant "ASCII with a Z (zero) at the end."

Is this the only way to store strings? No, in fact, it's one of the worst ways to store strings. For non-trivial programs, APIs, operating systems, class libraries, you should avoid ASCIZ strings like the plague.

Max Shawabkeh
  • 37,799
  • 10
  • 82
  • 91
  • great, thx... also what other methods may these be? thank you. – Joe DF Jun 06 '13 at 07:15
  • Is there a standard way in C99 to create a non-ASCIZ string? – Arc676 Oct 19 '15 at 14:17
  • 1
    This is apocryphal. I've looked in the PDP-7 manual and I can't find any mention of ASCIZ, null, or even data types at all. The only mention of ASCII is in the input program data and zero are in non-string places. There is a TEXT pseudo instruction but the user choses the delimiter. * http://bitsavers.trailing-edge.com/pdf/dec/pdp7/F-75P_PDP7prelimUM_Dec64.pdf * http://www.bitsavers.org/pdf/dec/pdp7/PDP-7_AsmMan.pdf – Pod Jan 20 '22 at 10:58
  • 1
    Infact the example program on the wikipedia page for PDP-8 shows a manual implementation of null terminated strings https://en.wikipedia.org/wiki/PDP-8 However I can find references in PDP-11 material. So it looks to me that ASCIZ was added to the hardware to support Unix and the C programming language ? – Pod Jan 20 '22 at 10:59
7

Think about what memory is: a contiguous block of byte-sized units that can be filled with any bit patterns.

2a c6 90 f6

A character is simply one of those bit patterns. Its meaning as a string is determined by how you treat it. If you looked at the same part of memory, but using an integer view (or some other type), you'd get a different value.

If you have a variable which is a pointer to the start of a bunch of characters in memory, you must know when that string ends and the next piece of data (or garbage) begins.

Example

Let's look at this string in memory...

H e l l o , w o r l d ! \0 
^
|
+------ Pointer to string

...we can see that the string logically ends after the ! character. If there were no \0 (or any other method to determine its end), how would we know when seeking through memory that we had finished with that string? Other languages carry the string length around with the string type to solve this.

I asked this question when my underlying knowledge of computers was limited, and this is the answer that would have helped many years ago. I hope it helps someone else too. :)

Community
  • 1
  • 1
alex
  • 479,566
  • 201
  • 878
  • 984
6

C strings are arrays of chars, and a C array is just a pointer to a memory location, which is the start location of the array. But also the length (or end) of the array must be expressed somehow; in case of strings, a null termination is used. Another alternative would be to somehow carry the length of the string alongside with the memory pointer, or to put the length in the first array location, or whatever. It's just a matter of convention.

Higher level languages like Java or PHP store the size information with the array automatically & transparently, so the user needn't worry about them.

Joonas Pulakka
  • 36,252
  • 29
  • 106
  • 169
5

C has no notion of strings by itself. Strings are simply arrays of chars (or wchars for unicode and such).

Due to those facts C has no way to check i.e. the length of the string as there is no "mystring->length", there is no length value set somewhere. The only way to find the end of the string is to iterate over it and check for the \0.

There are string-libraries for C which use structs like

struct string {
    int length;
    char *data;
};

to remove the need for the \0-termination but this is not standard C.

Languages like C++, PHP, Perl, etc have their own internal string libraries which often have a seperate length field that speeds up some string functions and remove the need for the \0.

Some other languages (like Pascal) use a string type that is called (suprisingly) Pascal String, it stores the length in the first byte of the string which is the reason why those strings are limited to a length of 255 characters.

4

Because in C strings are just a sequence of characters accessed viua a pointer to the first character.

There is no space in a pointer to store the length so you need some indication of where the end of the string is.

In C it was decided that this would be indicated by a null character.

In pascal, for example, the length of a string is recorded in the byte immediately preceding the pointer, hence why pascal strings have a maximum length of 255 characters.

pauljwilliams
  • 19,079
  • 3
  • 51
  • 79
1

It is a convention - one could have implemented it with another algorithm (e.g. length at the beginning of the buffer).

In a "low level" language such as assembler, it is easy to test for "NULL" efficiently: that might have ease the decision to go with NULL terminated strings as opposed of keeping track of a length counter.

jldupont
  • 93,734
  • 56
  • 203
  • 318
1

They need to be null terminated so you know how long they are. And yes, they are simply arrays of char.

Higher level languages like PHP may choose to hide the null termination from you or not use it at all - they may maintain a length, for example. C doesn't do it that way because of the overhead involved. High level languages may also not implement strings as an array of char - they could (and some do) implement them as lists of arrays of char, for example.

1

In C strings are represented by an array of characters allocated in a contiguous block of memory and thus there must either be an indicator stating the end of the block (ie. the null character), or a way of storing the length (like Pascal strings which are prefixed by a length).

In languages like PHP,Perl,C# etc.. strings may or may not have complex data structures so you cannot assume they have a null character. As a contrived example, you could have a language that represents a string like so:

class string
{
   int length;
   char[] data;
}

but you only see it as a regular string with no length field, as this can be calculated by the runtime environment of the language and is only used internally by it to allocate and access memory correctly.

Vishal Mistry
  • 376
  • 1
  • 5
0

They are null-terminated because whole plenty of Standard Library functions expects them to be.

Alexander Poluektov
  • 7,844
  • 1
  • 28
  • 32
  • 3
    And also because that is how the C language spec says that string literals are encoded. – Stephen C Feb 08 '10 at 11:59
  • @Stephen C, you are the only one who said it! Very important reason! Silly C strings... I would like a C, libc and string literals with "pascal strings". – Prof. Falken Nov 04 '10 at 00:15