12

When storing "byte arrays" (blobs...) is it better to use char or unsigned char for the items (unsigned char a.k.a. uint8_t)? (Standard says that sizeof of both is precisely 1 Byte.)

Does it matter at all? Or one is more convenient or prevalent than the other? Maybe, what libraries like Boost do use?

Emil Laine
  • 41,598
  • 9
  • 101
  • 157
Cartesius00
  • 23,584
  • 43
  • 124
  • 195

4 Answers4

13

If char is signed, then performing arithmetic on a byte value with the high bit set will result in sign extension when promoting to int; so, for example:

char c = '\xf0';
int res = (c << 24) | (c << 16) | (c << 8) | c;

will give 0xfffffff0 instead of 0xf0f0f0f0. This can be avoided by masking with 0xff.

char may still be preferable if you're interfacing with libraries that use it instead of unsigned char.

Note that a cast from char * to/from unsigned char * is always safe (3.9p2). A philosophical reason to favour unsigned char is that 3.9p4 in the standard favours it, at least for representing byte arrays that could hold memory representations of objects:

The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).

ecatmur
  • 152,476
  • 27
  • 293
  • 366
2

Theoretically, the size of a byte in C++ is dependant on the compiler-settings and target platform, but it is guaranteed to be at least 8 bits, which explains why sizeof(uint8_t) is required to be 1.

Here's more precisely what the standard has to say about it

§1.71

The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (2.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.

So, if you are working on some special hardware where bytes are not 8 bits, it may make a practical difference. Otherwise, I'd say that it's a matter of taste and what information you want to communicate via the choice of type.

Agentlien
  • 4,996
  • 1
  • 16
  • 27
1

One of the other problems with potentially using a signed value for blobs is that the value will depend on the sign representation, which is not part of the standard. So, it's easier to invoke undefined behavior.

For example...

signed char x = 0x80;
int y = 0xffff00ff;

y |= (x << 8); // UB

The actual arithmetic value would also strictly depend two's complement, which may give some people surprises. Using unsigned explicitly avoids these problems.

Jason
  • 3,777
  • 14
  • 27
0

makes no practcial difference although maybe from a readability point of view it is more clear if the type is unsigned char implying values 0..255.

AndersK
  • 35,813
  • 6
  • 60
  • 86