8

According to C11 WG14 draft version N1570:

The header <ctype.h> declares several functions useful for classifying and mapping characters. In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

Is it undefined behaviour?:

#include <ctype.h>
#include <limits.h>
#include <stdlib.h>

int main(void) {
  char c = CHAR_MIN; /* let assume that char is signed and CHAR_MIN < 0 */
  return isspace(c) ? EXIT_FAILURE : EXIT_SUCCESS;
}

Does the standard allow to pass char to isspace() (char to int)? In other words, is char after conversion to int representable as an unsigned char?


Here's how wiktionary defines "representable":

Capable of being represented.

Is char capable of being represented as unsigned char? Yes. §6.2.6.1/4:

Values stored in non-bit-field objects of any other object type consist of n × CHAR_BIT bits, where n is the size of an object of that type, in bytes. The value may be copied into an object of type unsigned char [n] (e.g., by memcpy); the resulting set of bytes is called the object representation of the value.

sizeof(char) == 1 therefore its object representation is unsigned char[1] i.e., char is capable of being represented as an unsigned char. Where am I wrong?

Concrete example, I can represent [-2, -1, 0, 1] as [0, 1, 2, 3]. If I can't then why?


Related: According to §6.3.1.3 isspace((unsigned char)c) is portable if INT_MAX >= UCHAR_MAX otherwise it is implementation-defined.

jww
  • 97,681
  • 90
  • 411
  • 885
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    I'd say it is unspecified whether or not it is undefined behaviour -- `char` can be unsigned, so `CHAR_MIN` can be `0`. For a signed char, `-1` is a valid value, but it cannot be represented as an `unsigned char` (it not in the range of representable values for this type). – dyp Sep 10 '14 at 23:48
  • @dyp: is it unspecified or implementation-defined? Let assume that `char` is `signed` (it is common). I'll update the question – jfs Sep 10 '14 at 23:51
  • 2
    @dyp: `signed`-ness of plain `char` must be documented, thus it is only implementation-defined whether it is undefined-behavior or well-defined. – Deduplicator Sep 10 '14 at 23:54
  • @Deduplicator You're right. It is either plain UB or implementation-defined whether or not it's UB. – dyp Sep 10 '14 at 23:57
  • @dyp: to answer my question in the comment: the draft says in 6.2.5/15 *"The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char. 45)"* i.e., it is not merely unspecified, it is implementation-defined (the implementation documents the choice). – jfs Sep 11 '14 at 00:05
  • The rationale for this specification is so that `isspace` etc. can be implemented via an array, e.g. on a typical system, `char const spaces[256] = { 0,0,0,0,0,0,0,0,0,1,0,0,1,0,` ... `#define isspace(x) ((int)spaces[x])`. Although in practice I suspect common compilers will support negative arguments just because it is such a common blunder to pass a negative argument. – M.M Sep 11 '14 at 02:25
  • @MattMcNabb: `spaces[(unsigned char)x]` – jfs Sep 11 '14 at 02:39
  • @dyp it seems like it is simpler to say it is undefined if *char* is *signed* and well defined otherwise. – Shafik Yaghmour Sep 11 '14 at 06:27

3 Answers3

10

What does representable in a type mean?

Re-formulated, a type is a convention for what the underlying bit-patterns mean. A value is thus representable in a type, if that type assigns some bit-pattern that meaning.

A conversion (which might need a cast), is a mapping from a value (represented with a specific type) to a value (possibly different) represented in the target type.


Under the given assumption (that char is signed), CHAR_MIN is certainly negative, and the text you quoted leaves no room for interpretation:
Yes, it is undefined behavior, as unsigned char cannot represent any negative numbers.

If that assumption did not hold, your program would be well-defined, because CHAR_MIN would be 0, a valid value for unsigned char.

Thus, we have a case where it is implementation-defined whether the program is undefined or well-defined.


As an aside, there is no guarantee that sizeof(int)>1 or INT_MAX >= CHAR_MAX, so int might not be able to represent all values possible for unsigned char.

As conversions are defined to be value-preserving, a signed char can always be converted to int.
But if it was negative, that does not change the impossibility of representing a negative value as an unsigned char. (The conversion is defined, as conversion from any integral type to any unsigned integral type is always defined, though narrowing conversions need a cast.)

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
  • `sizeof(int)==1` is certainly interesting; a DS9K implementation could have an unsigned character `char` type whose values are greater than what `int` can represent, it seems. – dyp Sep 11 '14 at 00:03
  • Another admirer of DS9K, I see. They should finally deliver mine. – Deduplicator Sep 11 '14 at 00:08
  • Could you elaborate the part *"the text you quoted leaves no room for interpretation"*? The question is what the word "representable" mean. I *can* represent all `char` values as `unsigned char` (the formula might depend on an implementation but it **exists** for all implementations). The standard may use different meaning. Could you describe that meaning? Imagine it is english.SE but for programmers. – jfs Sep 11 '14 at 00:15
  • @J.F.Sebastian: Changed much. Better now? – Deduplicator Sep 11 '14 at 00:36
  • @Deduplicator: according to your definition. `char` is representable as `unsigned char`. I've updated the question – jfs Sep 11 '14 at 00:59
  • @J.F.Sebastian: `char` is representable as `unsigned char` does not make any sense. Only values can (or cannot) be representable. According to my definition, all values representable with a `char` can be represent in either `unsigned char` or `signed char`, which is implementation-defined. – Deduplicator Sep 11 '14 at 01:34
  • @Deduplicator: `char` is a jargon. It can mean both the type and a value of the type in my question. The quote from the standard (§6.2.6.1/4) says exactly what I mean – jfs Sep 11 '14 at 01:39
  • `char` is most certainly always a type, never a value. The "object representation" is how the type maps a value to bytes (and thus bits). The interesting thing which seems to throw you is that byte and (`unsigned`) `char` are identical in C. What you showed is not "Is `char` capable of being represented as `unsigned char`?" but "Can a single `unsigned char` hold the object representation of a value of type `char`?". – Deduplicator Sep 11 '14 at 01:50
  • @dyp any `sizeof(int)==1` implementation would have `UCHAR_MAX > INT_MAX`, not just the DS9K – M.M Sep 11 '14 at 02:23
  • 1
    @Deduplicator: common. Have you ever said `char` is less than `CHAR_MAX + 1` meaning that any value of type `char` is less than `CHAR_MAX + 1` (mathematically)? It is common to refer to a set and a member of the set by the same name. And I've clarified using the direct reference §6.2.6.1/4 the meaning to avoid any ambiguity. – jfs Sep 11 '14 at 02:52
  • Nope, not without that critical "every". And while it is a common idiom to name one representative part of a whole if you mean the whole, referring to a set and a member of the set by the same name is rather uncommon. Anyway, when speaking standards, skoppy wording should be avoided. – Deduplicator Sep 11 '14 at 03:54
4

Under the assumption that char is signed then this would be undefined behavior, otherwise it is well defined since CHAR_MIN would have the value 0. It is easier to see the intention and meaning of:

the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF

if we read section 7.4 Character handling <ctype.h> from the Rationale for International Standard—Programming Languages—C which says (emphasis mine going forward):

Since these functions are often used primarily as macros, their domain is restricted to the small positive integers representable in an unsigned char, plus the value of EOF. EOF is traditionally -1, but may be any negative integer, and hence distinguishable from any valid character code. These macros may thus be efficiently implemented by using the argument as an index into a small array of attributes.

So valid values are:

  1. Positive integers that can fit into unsigned char
  2. EOF which is some implementation defined negative number

Even though this is C99 rationale since the particular wording you are referring to does not change from C99 to C11 and so the rationale still fits.

We can also find why the interface uses int as an argument as opposed to char, from section 7.1.4 Use of library functions, it says:

All library prototypes are specified in terms of the “widened” types an argument formerly declared as char is now written as int. This ensures that most library functions can be called with or without a prototype in scope, thus maintaining backwards compatibility with pre-C89 code. Note, however, that since functions like printf and scanf use variable-length argument lists, they must be called in the scope of a prototype.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • @J.F.Sebastian there is nothing wrong with that, the cast will convert a the signed char(*assuming it is signed, otherwise it will do nothing*) into the range of unsigned char via the rules in `6.3.1.3 p 2`. – Shafik Yaghmour Sep 11 '14 at 07:15
  • I know, §6.3.1.3 is referenced in both my question and [answer](http://stackoverflow.com/a/25778549/4279). I meant why `(unsigned char)` is not used inside `isspace()` itself to avoid undefined behavior. – jfs Sep 11 '14 at 07:30
  • 2
    @J.F.Sebastian seems likely to avoid overlap with `EOF` which would be bad. It also makes for a clear interface. – Shafik Yaghmour Sep 11 '14 at 07:33
  • UB for a `char` value passed as `int` is *not* a clear interface. It is the exact opposite. It may work (it works with both gcc, clang on my system), it may crash your program or send a letter to Mars. All are valid according to "interface". I would understand the argument about uncompromising time performance e.g., `(c != EOF) && (_uctype[(unsigned char)c] & _SPACE)` might be slower than `(_ctype + 1)[c] & _S` if the safer choice were available. – jfs Sep 11 '14 at 18:06
  • @J.F.Sebastian when I said *clear* I was working with the restriction that we are given a choice between a more restricted domain and allowing negative values to be cast back into the domain the prior provides a clearer interface. – Shafik Yaghmour Sep 12 '14 at 13:08
  • 2
    I might misunderstand the word "clear" but it is a terrible API design that passing a character to a character classification function may lead to undefined behavior. – jfs Oct 26 '14 at 05:11
  • @jfs: There are many places where the Standard should have defined a "preferred" behavior and an "acceptable" behavior, along with a macro to indicate which an implementation supports. Support for all values of `char` should be a "preferred" behavior, but limiting support to members of the Execution Character Set should be "acceptable". A Strictly Conforming program should be able to use `#ifdef` to determine whether an implementation behaves in "preferred" fashion, and refuse to run on any that don't. – supercat Oct 10 '18 at 21:02
1

The revealing quote (for me) is §6.3.1.3/1:

if the value can be represented by the new type, it is unchanged.

i.e., if the value has to be changed then the value can't be represented by the new type.

Therefore an unsigned type can't represent a negative value.

To answer the question in the title: "representable" refers to "can be represented" from §6.3.1.3 and unrelated to "object representation" from §6.2.6.1.

It seems trivial in retrospect. I might have been confused by the habit of treating b'\xFF', 0xff, 255, -1 as the same byte in Python:

>>> (255).to_bytes(1, 'big')
b'\xff'
>>> int.from_bytes(b'\xFF', 'big')
255
>>> 255 == 0xff
True
>>> (-1).to_bytes(1, 'big', signed=True)
b'\xff'

and the disbelief that it is an undefined behavior to pass a character to a character classification function e.g., isspace(CHAR_MIN).

jfs
  • 399,953
  • 195
  • 994
  • 1,670