74

I want to use a function that expects data like this:

void process(char *data_in, int data_len);

So it's just processing some bytes really.

But I'm more comfortable working with "unsigned char" when it comes to raw bytes (it somehow "feels" more right to deal with positive 0 to 255 values only), so my question is:

Can I always safely pass a unsigned char * into this function?

In other words:

  • Is it guaranteed that I can safely convert (cast) between char and unsigned char at will, without any loss of information?
  • Can I safely convert (cast) between pointers to char and unsigned char at will, without any loss of information?

Bonus: Is the answer same in C and C++?

user2015453
  • 4,844
  • 5
  • 25
  • 27
  • it's safe to use char* to represent bytes as the IO stardard libary does: std::istream& std::istream::read (char* s, streamsize n); std::ostream& std::ostream::write(char*, streamsize); – BruceAdi Mar 07 '13 at 03:58

6 Answers6

120

The short answer is yes if you use an explicit cast, but to explain it in detail, there are three aspects to look at:

1) Legality of the conversion
Converting between signed T* and unsigned T* (for some type T) in either direction is generally possible because the source type can first be converted to void * (this is a standard conversion, §4.10), and the void * can be converted to the destination type using an explicit static_cast (§5.2.9/13):

static_cast<unsigned char*>(static_cast<void *>(data_in))

This can be abbreviated (§5.2.10/7) as

reinterpret_cast<unsigned char *>(data_in)

because char is a standard-layout type (§3.9.1/7,8 and §3.9/9) and signedness does not change alignment (§3.9.1/1). It can also be written as a C-style cast:

(unsigned char *)(data_in)

Again, this works both ways, from unsigned* to signed* and back. There is also a guarantee that if you apply this procedure one way and then back, the pointer value (i.e. the address it's pointing to) won't have changed (§5.2.10/7).

All of this applies not only to conversions between signed char * and unsigned char *, but also to char */unsigned char * and char */signed char *, respectively. (char, signed char and unsigned char are formally three distinct types, §3.9.1/1.)

To be clear, it doesn't matter which of the three cast-methods you use, but you must use one. Merely passing the pointer will not work, as the conversion, while legal, is not a standard conversion, so it won't be performed implicitly (the compiler will issue an error if you try).

2) Well-definedness of the access to the values
What happens if, inside the function, you dereference the pointer, i.e. you perform *data_in to retrieve a glvalue for the underlying character; is this well-defined and legal? The relevant rule here is the strict-aliasing rule (§3.10/10):

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

  • [...]
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • [...]
  • a char or unsigned char type.

Therefore, accessing a signed char (or char) through an unsigned char* (or char) and vice versa is not disallowed by this rule – you should be able to do this without problems.

3) Resulting values
After derefencing the type-converted pointer, will you be able to work with the value you get? It's important to bear in mind that the conversion and dereferencing of the pointer described above amounts to reinterpreting (not changing!) the bit pattern stored at the address of the character. So what happens when a bit pattern for a signed character is interpreted as that of an unsigned character (or vice versa)?

When going from unsigned to signed, the typical effect will be that for values between 0 and 128 nothing happens, and values above 128 become negative. Similar in reverse: When going from signed to unsigned, negative values will appear as values greater than 128.

But this behaviour isn't actually guaranteed by the Standard. The only thing the Standard guarantees is that for all three types, char, unsigned char and signed char, all bits (not necessarily 8, btw) are used for the value representation. So if you interpret one as the other, make a few copies and then store it back to the original location, you can be sure that there will be no information loss (as you required), but you won't necessarily know what the values actually mean (at least not in a fully portable way).

jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • That's a great answer and it makes a lot of sense! But you seem to be addressing C++ specifically (which is great), but can you update it to contain also how plain C would differ from C++? i'm particularly interested in knowing if your last paragraph (about the bits and information loss) is also guaranteed to hold for plain C. – user2015453 Mar 02 '13 at 13:57
  • @user2015453 Thanks -- I believe all of this applies to C as well, but it'll take me a little longer to check this. I'll update the answer once I am sure. – jogojapan Mar 02 '13 at 13:59
  • For 1) the argument can be much shorter, at least for C. C guarantees that `void*` and pointers to the `char` types have the same representation. – Jens Gustedt Mar 07 '13 at 09:21
  • 2
    For 2) and 3) things are bit more complicated, at least for C. The representation of neither type can have padding bits, that is correct. But the signed types (`signed char` and eventually `char` if it is signed) could have a "trap" representation. That would be the bit pattern that corresponds to "negative zero", if it is implementation defined if this is a valid value for these types or not. E.g the constant `SCHAR_MIN` could be just `127` instead of `128`. I don't know of any real existing architecture that has this, though. – Jens Gustedt Mar 07 '13 at 09:27
  • @JensGustedt Interesting. If the latter part is true for C, then, I am pretty sure, it is true for C++, too. But that still doesn't mean you'd lose information if you convert between `unsigned char` and `signed char`, correct? It just means you'd get somewhat unexpected values, or did I misunderstand? ... and actually, I am sure the C++ Standard doesn't say anything about _how_ the bits inside these types are to be used, except that it insists that _all_ of them be used. So the trap representation is certainly possible. – jogojapan Mar 07 '13 at 09:31
  • @jogojapan, the problem is that accessing a trap value is to "perform a trap" which is defined as *"interrupt execution of the program such that no further operations are performed"* so in short it is illegal to access such a value. – Jens Gustedt Mar 07 '13 at 09:36
  • @JensGustedt I see. That would indeed be relevant; I'll take this into account when I update the answer. – jogojapan Mar 07 '13 at 09:40
  • 1
    Signed integer overflow is still undefined behaviour in C++. So wouldn't interpreting an `unsigned char` as a `char` constitute UB? – juanchopanza Jun 04 '14 at 06:58
  • 1
    @juanchopanza That's for arithmetic operations applied to the signed integer, not for type-casting / assignments. (If I am not mistaken, that is.) – jogojapan Jun 04 '14 at 11:52
17

unsigned char or signed char is just interpretation: there is no conversion happening.

Since you are processing bytes, to show intent, it would be better to declare as

void process(unsigned char *data_in, int data_len);

[As noted by an editor: A plain char may be either a signed or an unsigned type. The C and C++ standards explicitly allow either (it is always a separate type from either unsigned char or signed char, but has the same range as one of them)]

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
7

Yes, you can always convert from char to unsigned char & vice versa without problems. If you run the following code, and compare it with an ASCII table (ref. http://www.asciitable.com/), you can see a proof by yourself, and how the C/C++ deal with the conversions - they deal exactly in the same way:

#include "stdio.h"


int main(void) {
    //converting from char to unsigned char
    char c = 0;
    printf("%d byte(s)\n", sizeof(char));  // result: 1byte, i.e. 8bits, so there are 2^8=256 values that a char can store.
    for (int i=0; i<256; i++){
        printf("int value: %d - from: %c\tto: %c\n", c,  c, (unsigned char) c);
        c++;
    }

    //converting from unsigned char to char
    unsigned char uc = 0;
    printf("\n%d byte(s)\n", sizeof(unsigned char));
    for (int i=0; i<256; i++){
        printf("int value: %d - from: %c\tto: %c\n", uc, uc, (char) uc);
        uc++;
    }
}

I will not post the output because it has too many lines! It can be noticed in the output that in the first half of each section, i.e. from i=0:127, the conversion from chars to unsigned chars and vice-versa works well, without any modification or loss.

However, from i=128:255 the chars and the unsigned chars cannot be casted, or you would have different outputs, because unsigned char saves the values from [0:256] and char saves the values in the interval [-128:127]). Nevertheless, the behaviour in this 2nd half is irrelevant, because in C/C++, in general, you only lead with chars/unsigned chars as ASCII characters, whose can take only 128 different values and the other 128 values (positive for chars or negative for unsigned chars) are never used.

If you never put a value in a char that doesn't represent a character, and you never put a value in an unsigned char that doesn't represent a character, everything will be OK!

extra: even if you use UTF-8 or other encodings (for special characters) in your strings with C/C++, everything with this kind of casts would be OK, for instance, using UTF-8 encoding (ref. http://lwp.interglacial.com/appf_01.htm):

char hearts[]   = {0xe2, 0x99, 0xa5, 0x00};
char diamonds[] = {0xe2, 0x99, 0xa6, 0x00};
char clubs[]    = {0xe2, 0x99, 0xa3, 0x00};
char spades[]   = {0xe2, 0x99, 0xa0, 0x00};
printf("hearts (%s)\ndiamonds (%s)\nclubs (%s)\nspades (%s)\n\n", hearts, diamonds, clubs, spades);

the output of that code will be:
hearts (♥)
diamonds (♦)
clubs (♣)
spades (♠)

even if you cast each of its chars to unsigned chars.

so:

  • "can I always safely pass a unsigned char * into this function?" yes!

  • "is it guaranteed that I can safely convert (cast) between char and unsigned char at will, without any loss of information?" yes!

  • "can I safely convert (cast) between pointers to char and unsigned char at will, without any loss of information?" yes!

  • "is the answer same in C and C++?" yes!

sissi_luaty
  • 2,839
  • 21
  • 28
3

Semantically, passing between unsigned char * and char * are safe, and even though casting between them, so as in c++.

However, consider the following sample code:

#include "stdio.h"

void process_unsigned(unsigned char *data_in, int data_len) {
    int i=data_len;
    unsigned short product=1;

    for(; i--; product*=data_in[i]) 
        ;

    for(i=sizeof(product); i--; ) {
        data_in[i]=((unsigned char *)&product)[i];
        printf("%d\r\n", data_in[i]);
    }
}

void process(char *data_in, int data_len) {
    int i=data_len;
    unsigned short product=1;

    for(; i--; product*=data_in[i]) 
        ;

    for(i=sizeof(product); i--; ) {
        data_in[i]=((unsigned char *)&product)[i];
        printf("%d\r\n", data_in[i]);
    }
}

void main() {
    unsigned char 
        a[]={1, -1}, 
        b[]={1, -1};

    process_unsigned(a, sizeof(a));
    process(b, sizeof(b));
    getch();
}

output:

0
255
-1
-1

All the code inside process_unsigned and process are just IDENTICAL. The only difference is unsigned and signed. This sample shows that the code in the black box, do be affected by the SIGN, and nothing is guaranteed between the callee and caller.

Thus I would say that, it's applicable of passing only, but none of any other possibilities is guaranteed.

Ken Kin
  • 4,503
  • 3
  • 38
  • 76
2

You can pass a pointer to a different kind of char, but you may need to explicitly cast it. The pointers are guaranteed to be the same size and the same values. There isn't going to be any information loss during the conversion.

If you want to convert char to unsigned char inside the function, you just assign a char value to an unsigned char variable or cast the char value to unsigned char.

If you need to convert unsigned char to char without data loss, it's a bit harder, but still possible:

#include <limits.h>

char uc2c(unsigned char c)
{
#if CHAR_MIN == 0
  // char is unsigned
  return c;
#else
  // char is signed
  if (c <= CHAR_MAX)
    return c;
  else
    // ASSUMPTION 1: int is larger than char
    // ASSUMPTION 2: integers are 2's complement
    return c - CHAR_MAX - 1 - CHAR_MAX - 1;
#endif
}

This function will convert unsigned char to char in such a way that the returned value can be converted back to the same unsigned char value as the parameter.

Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
  • So this "wrapping around" is not done automatically be the language when I assign unsigned char to a char? – user2015453 Feb 26 '13 at 09:27
  • It's only done automatically for unsigned types. Overflows in signed integers result in undefined behavior. The fact that your compiler might be handling signed overflows the same way as unsigned overflows is either luck or a documented feature. Another compiler can totally ruin your code when it sees a possibility for undefined behavior. – Alexey Frunze Feb 26 '13 at 09:31
  • @AlexeyFrunze: Consider if `UCHAR_MAX == 255`, `CHAR_MIN == -127` and `CHAR_MAX == 127`. How many distinct values can an `unsigned char` represent? How many distinct values can a `char` represent? Undefined behaviour might occur in your code when `c == CHAR_MAX + 1`, because there might not be a signed value that converts to it. I suggest: `if (c <= CHAR_MAX) { return c; } else if (c < (unsigned char) CHAR_MIN) { /* negative zero */ return 0; } else { return -(UCHAR_MAX - c + 1); }` – autistic Mar 02 '13 at 02:39
  • When an overflow occurs during the conversion from unsigned to signed, the result is implementation-defined (§4.7/3), but it's not UB. (UB occurs when arithmetic operations performed on a signed type result in overflows.) – jogojapan Mar 02 '13 at 06:33
  • @modifiablelvalue I willingly neglect the possibility of having non-2's-complement representations and symmetric 2's complement representations. That is why I put there a comment. I know what you're talking about. – Alexey Frunze Mar 02 '13 at 08:49
  • 1
    @jogojapan Yes, you're right, it's either implementation-defined value or implementation-defined signal (at least in C99). I'm always forgetting about this subtle distinction. – Alexey Frunze Mar 02 '13 at 08:54
  • @AlexeyFrunze Why not willingly neglect the possibility that an implementation might defined the implementation-defined behaviour of unsigned-to-signed-negative conversion to function differently to your code, and the undefined behaviour of signed integer overflow to do otherwise than silently wrapping, since *most implementations function this way*? – autistic Mar 03 '13 at 04:21
  • @modifiablelvalue What's your point? Reducing things ad absurdum or is there something more practical in your mind? – Alexey Frunze Mar 03 '13 at 10:19
  • @jogojapan: This implementation-defined behaviour may become undefined behaviour as a result of 7.14.1.1p3, if the implementation-defined signal corresponds to a computational exception and the default handler returns. – autistic Mar 08 '13 at 02:02
  • @modifiablelvalue Yes, like any implementation-defined behaviour, you need to handle it correctly. But that doesn't justify calling the operation of initializing a signed integer with an unsigned one "undefined behaviour". It only means you need to do it in the right way. (By the way, _is_ there an implementation that really raises a signal? GCC does not: http://gcc.gnu.org/onlinedocs/gcc/Integers-implementation.html) – jogojapan Mar 08 '13 at 03:01
  • @jogojapan Will there ever be such an implementation? Since C says it can happen, it just might. When C was initially developed, it was with the idea that most software should be *portable*. Hence, it makes guarantees regarding portability that are useful on any conforming implementation at the present and well into the future. This particular implementation-defined behaviour is not one of those guarantees. We've seen compilers have their ten minutes of fame, so anything said about a specific implementation is irrelevant to this discussion... and this discussion is irrelevant to this comment. – autistic Mar 08 '13 at 05:12
1

You really need to view the code to process() to know if you can safely pass in unsigned characters. If the function uses the characters as an index into an array, then no, you can't use unsigned data.

Sean Conner
  • 416
  • 2
  • 3