21

I wrote this function in C, which is meant to iterate through a string to the next non-white-space character:

char * iterate_through_whitespace(unsigned char * i){
    while(*i && *(i++) <= 32);
    return i-1;
}

It seems to work quite well, but I'm wondering if it is safe to assume that the *i will be evaluated to false in the situation that *i == '\0', and it won't iterate beyond the end of a string. It works well on my computer, but I'm wondering if it will behave the same when compiled on other machines.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
Paul
  • 139,544
  • 27
  • 275
  • 264
  • good question. More people should ask themselves before assuming. Triva: what happens when you nullterminate a UTF8 string? After a double/triple/quadbyte leader? In UCS-16? Is the terminator two bytes then, or are zero terminators deprecated for UNICODE? – sehe Sep 25 '11 at 23:00
  • 2
    @sehe: Null terminators work normally for UTF-8 strings. For UCS-2 or UTF-16 (not UCS-16), null terminators are 16 bits. – Keith Thompson Sep 25 '11 at 23:11
  • @Keith: Your point is true but incomplete. A UTF-8 string that has a null terminator after a partial character is malformed and will result in `EILSEQ` when the null byte is encountered when converting it with standard library functions. – R.. GitHub STOP HELPING ICE Sep 26 '11 at 00:26
  • @R..: Then I'd argue that it's not a UTF-8 string. And it won't cause an `EILSEQ` error for non-converting functions like `strcpy()`. Good point, though. (And the original poster can probably ignore these details, at least for now.) – Keith Thompson Sep 26 '11 at 01:08

5 Answers5

16

The standard says:

A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

cnicutar
  • 178,505
  • 25
  • 365
  • 392
  • 1
    And a byte with all bits set to 0 has the value 0. That may seem obvious, and it's true, but you'd have to do a bit of searching in the standard to demonstrate it. – Keith Thompson Sep 25 '11 at 23:00
  • @Keith Thompson I was searching for a stronger assertion but I can't seem to find anything really relevant. – cnicutar Sep 25 '11 at 23:02
  • I think saw a corrigenda that made sure memsetting a and int with 0 would yield a 0 value, but I can't find oit. – Artefacto Sep 25 '11 at 23:04
  • 3
    You have to dig into [C99](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf) 6.2.6.2, which covers the representation of integer types. It requires a binary representation (which it defines) and says that the character types have no padding bits. – Keith Thompson Sep 25 '11 at 23:08
  • 1
    @Artefacto: N1256 6.2.6.2p5 -- but the required lack of padding bits for the character types makes it unnecessary for those types. – Keith Thompson Sep 25 '11 at 23:09
  • @Keith Thompson Strangely "For any integer type, the object representation where all the bits are zero shall be a representation of the value zero in that type." is missing from my copy. Thanks for the heads-up. – cnicutar Sep 25 '11 at 23:20
  • @cnicutar: That was added in the second (I think) Technical Corrigendum; it's not in the original ISO C99 standard. – Keith Thompson Sep 26 '11 at 01:04
14

Yes -- but in my opinion it's better style to be more explicit:

while (*i != '\0' && ...

But the comparison to 32 is hardly the best approach. 32 happens to be the ASCII/Unicode code for the space character, but C doesn't guarantee any particular character set -- and there are plenty of control characters with values less than 32 that aren't whitespace.

Use the isspace() function.

(And I'd never name a pointer i.)

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
3

In C, '\0' has the exact same value and type as 0. There is no reason to ever write '\0' except to uglify your code. \0 might however be useful inside double quotes to make strings with embedded null bytes.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    I disagree. It does have the same type and value as `0`, but I prefer to use `'\0'` when it's going to be used as a character -- just as I like to use `NULL` rather than `0` in pointer context. – Keith Thompson Sep 26 '11 at 03:52
  • @Keith, where I agree for `'\0'`, for `NULL` the technicalities of null pointers are so messy that I prefer to use `0`, simply because it is unambiguous. – Jens Gustedt Sep 26 '11 at 09:03
  • @JensGustedt: How is `NULL` ambiguous? You need to cast it when passing it as a variadic argument, but the same applies to `0`. – Keith Thompson Sep 26 '11 at 15:52
  • Using `NULL` can hide bugs where it matters that you have a pointer type and not an integer type, since `NULL` can be of integer type. Using `0` will make the code break right away (or at least throw the relevant warnings) so you can fix it. I agree with Jens and also avoid ever using `NULL`. – R.. GitHub STOP HELPING ICE Sep 26 '11 at 16:05
  • @Keith, unfortunately `NULL` without cast will do on many platforms, where `0` without a cast will crash reliably. – Jens Gustedt Sep 26 '11 at 16:08
  • @JensGustedt: If `NULL` is defined as `((void*)0)`, yes -- but using either `NULL` or `0` without a cast in that context has undefined behavior anyway. The solution is to add the cast. – Keith Thompson Sep 26 '11 at 16:23
  • @Keith: The point is whether or not the compiler can warn you that your code is non-portable. If you use `NULL`, the compiler would need special knowledge that `NULL` was defined by the standard headers and not redefined by your application in order to issue a warning that the usage is non-portable. In reality, no compiler does this. If you use 0, you'll get much better warnings when you misuse it. – R.. GitHub STOP HELPING ICE Sep 26 '11 at 16:29
  • @Keith, if `NULL` happens to be of type `void*` and the function expects such a thing, the behavior isn't undefined, and so it may go unnoticed for a long time. And if we have to add the cast, we can equally write `(void*)0`. This as clear as `NULL` to mark the intention that this is a pointer value. So `NULL` serves no purpose at all. http://gustedt.wordpress.com/2010/11/07/dont-use-null/ – Jens Gustedt Sep 26 '11 at 18:11
  • @JensGustedt: It makes a difference only in the context of an argument to a variadic function. Passing a null pointer constant to any of the standard-defined variadic functions is rare; the POSIX `exec*()` functions are probably the most common case. (Functions with old-style declarations are another case, but that's increasingly irrelevent.) In all other cases, either `NULL` or `0` will be implicitly converted to the appropriate pointer type. In those contexts, `NULL` vs. `0` makes no difference to the compiler, but it's clearer *to the reader* that it's a pointer expression. – Keith Thompson Sep 26 '11 at 18:20
0

I find the other answers inadequate because they do not provide a direct answer to the question in the title.

Is '\0' guaranteed to be 0?

No, the integer value of the construction '\0' is not guaranteed to be 0 by the C standard.

Regarding the null character, all we know is that (C99 p.17, C11 p.22)

[a] byte with all bits set to 0, called the null character, shall exist in the basic execution set.

and that (C99 p. 61, C11 p.69)

[t]he construction '\0' is commonly used to represent the null character.

Emphasis on "commonly used". There is no guarantee.

OTheDev
  • 2,916
  • 2
  • 4
  • 20
0

The ASCII standard dictates that the NUL character is encoded as the byte 0. Unless you stop working with encodings that are backwards compatible with ASCII, nothing should go wrong.

zneak
  • 134,922
  • 42
  • 253
  • 328
  • The C standard dictates this too. – Marcelo Cantos Sep 25 '11 at 23:00
  • This question nothing to do with ASCII, which the C standard doesn't even require to be used. – Artefacto Sep 25 '11 at 23:01
  • @Artefacto, hence the "unless you stop working with encodings that are backwards compatible with ASCII". I did not try to link C with ASCII. – zneak Sep 25 '11 at 23:01
  • @Artefacto: Well, the question does make the ASCII-specific assumption that whitespace characters have values <= 32 -- but it shouldn't. – Keith Thompson Sep 25 '11 at 23:02
  • @zneak: It doesn't matter whether you're using ASCII-compatible encodings or not. *All* encodings used by any C implementation, ASCII or not, must represent the null character as 0. (As it happens, both ASCII and EBCDIC do so; if that weren't the case, the C standard probably wouldn't have required it.) – Keith Thompson Sep 25 '11 at 23:03
  • @Keith, I don't have a copy of the C standard to verify, but I'll trust you. Either way, aren't we saying the same thing from a different point of view? – zneak Sep 25 '11 at 23:05
  • @zneak: Not exactly. Even if you do "stop working with encodings that are backwards compatible with ASCII" (for example, if you switch to EBCDIC), the guarantee still applies. See one of my comments for a link to a PDF of the latest post-C99 draft. – Keith Thompson Sep 25 '11 at 23:12