There are several posts on the internet that suggest that you should use std::vector<unsigned char>
or something similar for binary data.
But I'd much rather prefer a std::basic_string
variant for that, since it provides many convenient string manipulation functions. And AFAIK, since C++11, the standard guarantees what every known C++03 implementation already did: that std::basic_string
stores its contents contiguously in memory.
At first glance then, std::basic_string<unsigned char>
might be a good choice.
I don't want to use std::basic_string<unsigned char>
, however, because almost all operating system functions only accept char*
, making an explicit cast necessary. Also, string literals are const char*
, so I would need an explicit cast to const unsigned char*
every time I assigned a string literal to my binary string, which I would also like to avoid. Also, functions for reading from and writing to files or networking buffers similarly accept char*
and const char*
pointers.
This leaves std::string
, which is basically a typedef for std::basic_string<char>
.
The only potential remaining issue (that I can see) with using std::string
for binary data is that std::string
uses char
(which can be signed).
char
, signed char
, and unsigned char
are three different types and char
can be either unsigned or signed.
So, when an actual byte value of 11111111b
is returned from std::string:operator[]
as char, and you want to check its value, its value can be either 255
(if char
is unsigned) or it might be "something negative" (if char
is signed, depending on your number representation).
Similarly, if you want to explicitly append the actual byte value 11111111b
to a std::string
, simply appending (char) (255)
might be implementation-defined (and even raise a signal) if char
is signed and the int
to char
conversation results in an overflow.
So, is there a safe way around this, that makes std::string
binary-safe again?
§3.10/15 states:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
- [...]
- a type that is the signed or unsigned type corresponding to the dynamic type of the object,
- [...]
- a char or unsigned char type.
Which, if I understand it correctly, seems to allow using an unsigned char*
pointer to access and manipulate the contents of a std::string
and makes this also well-defined. It just reinterprets the bit pattern as an unsigned char
, without any change or information loss, the latter namely because all bits in a char
, signed char
, and unsigned char
must be used for the value representation.
I could then use this unsigned char*
interpretation of the contents of std::string
as a means to access and change the byte values in the [0, 255]
range, in a well-defined and portable manner, regardless of the signedness of char
itself.
This should solve any problems arising from a potentially signed char
.
Are my assumptions and conclusions correct?
Also, is the unsigned char*
interpretation of the same bit pattern (i.e. 11111111b
or 10101010b
) guaranteed to be the same on all implementations? Put differently, does the standard guarantee that "looking through the eyes of an unsigned char
", the same bit pattern always leads to the same numerical value (assuming the number of bits in a byte is the same)?
Can I thus safely (that is, without any undefined or implementation-defined behavior) use std::string
for storing and manipulating binary data in C++11?