2

I'm using isspace in order to iterate a string and identify whitespace characters:

const char* s = "abcd efg";
const char* ptr = s;

for (; *ptr != '\0'; ptr++)
    printf("%c: %s\n", *ptr, isspace(*ptr) ? "yes" : "no");

As you know, isspace takes an int, not a char. The above seems to work - but I would like to validate whether or not it is portable, or "works by accident".

What would be the idomatic way to convert a char to an int, for use with isspace?

Aviv Cohn
  • 15,543
  • 25
  • 68
  • 131

1 Answers1

3

You should be using unsigned char to manage characters. For example, fgetc returns a “character as an unsigned char converted to an int” (C 2018 7.21.7.1). Use char can result in negative values and undefined behavior, as explained below.

7.4 1 defines the behavior of the <ctype.h> functions only for arguments whose value is representable as an unsigned char or EOF:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Thus, if you have a char with a negative value, and you pass it to one of the <ctype.h> functions, that value is not representable as an unsigned char. And it is generally not EOF either. The negative char value will be implicitly converted to an int by the function call, but the value will remain negative. So the behavior would not be defined by the C standard.

Per 6.2.5 3, all members of the basic execution character set have non-negative values:

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

Per 5.2.1 3, The basic execution character set includes at least the Latin alphabet in uppercase and lowercase, the ten digits, space, horizontal tab, vertical tab, form feed, alert, backspace, carriage return, new line, and these characters:

!"#%&’()*+,-./: ;?[\]^_{|}~

So, if your string has any other character, it could have a negative value. Then, the behavior of the <ctype.h> functions is not defined by the C standard.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • 2
    doubt no more. MSVCRT debug versions *assert* rather loudly on negative values. – Antti Haapala -- Слава Україні Apr 22 '19 at 19:40
  • 1
    I hammered the question. Also, EOF usually has value -1 and *it too* is accepted as an argument, and always returns "false" for any class, so it would be indistinguishable from `ŷ` say on Latin-1 with Turkish locale or so... – Antti Haapala -- Слава Україні Apr 22 '19 at 19:44
  • Hi Eric, thank you for the answer. Was wondering, do you know the reason why `isspace` and family take an `int`? Why not an `unsigned char` than? – Aviv Cohn Apr 22 '19 at 21:08
  • 1
    @AvivCohn: For convenience, they accept `EOF`, which is not a character. This is convenient for code that might want to branch on some expected classifications before it tests for the rarer `EOF` case, or the author just thinks their code is nicer that way. So the functions have to accept all `unsigned char` values plus one more. So an `int` is used. – Eric Postpischil Apr 22 '19 at 21:30
  • @AvivCohn: I use EOF to detect the end of a file of binary data. Binary data can include the value 0x00 like any other value, so some other value (outside of 0x00-0xff) is needed to detect the end of a file. When used this way EOF does not represent a character - it is a way for the system to tell you that no more characters are available. – Kevin Olree Apr 23 '19 at 01:37