21

Following the question titled Warning generated due wrong strcmp parameter handling, there seems to be some questions regarding what the Standard actually guarantees regarding value representation of character types.


THE QUESTION

This looks fine, but does the Standard guarantee that the (1) will always yield true?

char unsigned * p1 = ...;
char          * p2 = reinterpret_cast<char *> (p1);

*p1 == *p2; // (1)
Community
  • 1
  • 1
Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • `char const unsigned` sure is an unusual way to name the type. – MSalters Jun 04 '14 at 09:23
  • why would you ever want to use `unsigned char` (to hold a character rather than a small integer) instead of `char`? – Walter Jun 04 '14 at 09:37
  • 5
    @Walter some find the representation of bytes as a range [0,255] more intuitive than the range [-128,127], myself included. Additionally, it can be good to explicitly show something is a byte (unsigned char), not a character (char). – iFreilicht Jun 04 '14 at 12:21
  • @iFreilicht Granted. But I think he wants to interpret this as a character rather than a byte. – Walter Jun 04 '14 at 15:13
  • 1
    @walter when displaying as char and bytes (%02X) it is useful for avoiding sign extension issues. – EvilTeach Jun 04 '14 at 16:30

2 Answers2

18

THIS MIGHT SURPRISE YOU,

but there's no such guarantee in the C++11 Standard (N3337), nor in the upcoming C++14 (N3797).

char unsigned * p1 = ...;
char          * p2 = reinterpret_cast<char *> (p1);

*p1 == *p2; // (1), not guaranteed to be true

Note: it is implementation specific whether char is signed or unsigned; [basic.fundamental]p1.



DETAILS

The Standard guarantees that every character type shall;

  • have the same alignment requirement;
  • occupy the same amount of storage, and;
  • that all bits of the storage occupied by a character type shall participate in the value representation, and;
  • that the value representation is the same.

Sharing the same amount of storage, alignment requirement, and the guarantee about bit participation, means that casting a lvalue referring to one type (unsigned char), to another (char), is safe.. as far as the actual cast is concerned.

3.9.1p1 Fundamental types [basic.fundamental]

It is implementation-defined whether a char can hold negative values. Characters can be explicitly declared signed or unsigned.

A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For character types, all bits of the object representation participate in the value representation.

For unsigned character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types.

3.9p4 Types [basic.types]

The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T). The value representation of an object is the set of bits that hold the value of type T.



SO, WHAT ARE THE PROBLEM(s)?

If we assign the maximum value of an unsigned char (UCHAR_MAX) to *p1 and *p2 is signed, *p2 won't be able to represent this value. We will overflow *p2 and it will, most likely, end up having the value of -1.

Note: signed integer overflow is actually undefined behavior.


*p1 = UCHAR_MAX;

*p1 == *p2; // (1)

Both sides of operator== must have the same type before we can compare them, and currently one side is unsigned char and the other char.

The compiler will therefor resort to integral promotion to find a type that can represent all combined possible values of the two types; and in this case the resulting type will be int.

After the integral promotion the statement is semantically equivalent to int (UCHAR_MAX) == int(-1), which of course is false.

Filip Roséen - refp
  • 62,493
  • 20
  • 150
  • 196
  • Hmmmm... I had assumed that char was signed. – nishantjr Jun 04 '14 at 08:27
  • @FilipRoséen-refp `char` cannot have padding bits (will look where the standard says so). I'm not sure if `signed char` can, but if it can, then an implementation that gives `signed char` padding bits must make `char` unsigned. Regardless, `strcmp` doesn't actually use the `char` type of its parameters, it converts them back to `const unsigned char *` before comparing anything. –  Jun 04 '14 at 09:03
  • @FilipRoséen-refp 3.9.1p1: "For character types, all bits of the object representation participate in the value representation." –  Jun 04 '14 at 09:05
  • @hvd note that in the latest draft, only [unsigned narrow chars](http://stackoverflow.com/questions/23415661/has-c-standard-changed-with-respect-to-the-use-of-indeterminate-values-and-und) is guaranteed to not have undefined behavior when used containing an indeterminate value. Which indicates to me that only unsigned char is guaranteed to not have a trap representation. – Shafik Yaghmour Jun 04 '14 at 09:30
  • @ecatmur as stated, `3.9p2` is about the possibility of storing bytes in a plain char array, not about reading them as if they were a plain char. – Filip Roséen - refp Jun 04 '14 at 09:36
  • @FilipRoséen-refp What about reading §3.10/10? It clearly states that it is allowed to access the stored value through a glvalue of type (plain) `char`. – Columbo Jun 04 '14 at 09:37
  • @Arcoth that doesn't prove that there can be no trap-values in `char`. – Filip Roséen - refp Jun 04 '14 at 09:39
  • 2
    Minor nitpick, in the question there is no assignment, and no overflow. The values are not equal solely due to integer promotion rules and the fact that char can have negative values. – Remember Monica Jun 04 '14 at 15:16
  • @MarcLehmann Why does there need to be assignment to be overflow? The simple dereference and read of the `char*` results in an implementation-defined value, and it may be a trap value if `char` is signed (and hence cause the program to halt), or worse. – Yakk - Adam Nevraumont Jun 04 '14 at 17:22
  • @MarcLehmann it's implementation defined whether a char can hold negative values, and the assignment is just to make the example easier to understand. – Filip Roséen - refp Jun 04 '14 at 17:28
  • A trap value cannot (by definition) cause an overflow, it can trap. If it's implementation defined whether a char can have negative values, then char can have negative values, no? – Remember Monica Jul 23 '14 at 21:11
14
strcmp (buf1, reinterpret_cast<char const *> (buf2));

This looks fine,

It is. strcmp takes const char * parameters, but internally converts them to const unsigned char * (if required), so that even if char is signed and two distinct bytes can compare equal when viewing them as char, they will still compare different when viewing them with strcmp.

C99:

7.21 String handling <string.h>

7.21.1 String function conventions

3 For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value).

That said,

but does the Standard guarantee that the (1) will always yield true?

char unsigned * p1 = ...;
char          * p2 = reinterpret_cast<char *> (p1);

*p1 == *p2; // (1)

What you wrote is not guaranteed.

Take a common implementation, with signed char, 8-bit bytes using two's complement representation. If *p1 is UCHAR_MAX, then *p2 == -1, and *p1 == *p2 will be false because the promotion to int gives them different values.

If you meant either (char) *p1 == *p2, or *p1 == (unsigned char) *p2, then those are still not guaranteed, so you do need to make sure that if you copy from an array of char to an array of unsigned char, you don't include such a conversion.

Community
  • 1
  • 1