Can I safely use std::string for binary data in C++11?

Question

There are several posts on the internet that suggest that you should use std::vector<unsigned char> or something similar for binary data.

But I'd much rather prefer a std::basic_string variant for that, since it provides many convenient string manipulation functions. And AFAIK, since C++11, the standard guarantees what every known C++03 implementation already did: that std::basic_string stores its contents contiguously in memory.

At first glance then, std::basic_string<unsigned char> might be a good choice.

I don't want to use std::basic_string<unsigned char>, however, because almost all operating system functions only accept char*, making an explicit cast necessary. Also, string literals are const char*, so I would need an explicit cast to const unsigned char* every time I assigned a string literal to my binary string, which I would also like to avoid. Also, functions for reading from and writing to files or networking buffers similarly accept char* and const char* pointers.

This leaves std::string, which is basically a typedef for std::basic_string<char>.

The only potential remaining issue (that I can see) with using std::string for binary data is that std::string uses char (which can be signed).

char, signed char, and unsigned char are three different types and char can be either unsigned or signed.

So, when an actual byte value of 11111111b is returned from std::string:operator[] as char, and you want to check its value, its value can be either 255 (if char is unsigned) or it might be "something negative" (if char is signed, depending on your number representation).

Similarly, if you want to explicitly append the actual byte value 11111111b to a std::string, simply appending (char) (255) might be implementation-defined (and even raise a signal) if char is signed and the int to char conversation results in an overflow.

So, is there a safe way around this, that makes std::string binary-safe again?

§3.10/15 states:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

[...]

a type that is the signed or unsigned type corresponding to the dynamic type of the object,

[...]

a char or unsigned char type.

Which, if I understand it correctly, seems to allow using an unsigned char* pointer to access and manipulate the contents of a std::string and makes this also well-defined. It just reinterprets the bit pattern as an unsigned char, without any change or information loss, the latter namely because all bits in a char, signed char, and unsigned char must be used for the value representation.

I could then use this unsigned char* interpretation of the contents of std::string as a means to access and change the byte values in the [0, 255] range, in a well-defined and portable manner, regardless of the signedness of char itself.

This should solve any problems arising from a potentially signed char.

Are my assumptions and conclusions correct?

Also, is the unsigned char* interpretation of the same bit pattern (i.e. 11111111b or 10101010b) guaranteed to be the same on all implementations? Put differently, does the standard guarantee that "looking through the eyes of an unsigned char", the same bit pattern always leads to the same numerical value (assuming the number of bits in a byte is the same)?

Can I thus safely (that is, without any undefined or implementation-defined behavior) use std::string for storing and manipulating binary data in C++11?

When you write, "almost all standard library functions only accept char*", can you explain what library functions you want to use? If you stick to the C++ standard library, the `std::basic_string` member functions will reflect the actual `CharT` type you provide. Example: `std::basic_string::c_str()` returns `const unsigned char*`. In terms of IO, if you can create an istream or ostream templated on `unsigned char`, everything will interoperate. — NicholasM, Nov 03 '13 at 20:27
You can just create a derived class of `std::basic_string` and make some implicit casts for it. — Zaffy, Nov 03 '13 at 20:29
Just use a vector. You can do anything with it that you can do with a string. — jrok, Nov 03 '13 at 20:30
@NicholasM: I mistyped, I meant OS library functions. I'll correct it. — JohnCand, Nov 03 '13 at 20:31
What kind of "convenient string manipulation functions" do you want to use on binary data? How would that ever actually make sense? — jalf, Nov 03 '13 at 20:35
Most string processing will fail when you have null bytes in your data. — Frank Osterfeld, Nov 03 '13 at 20:37
@jalf: In my case it makes sense, because although the data is binary, it frequently contains UTF-8 strings at certain locations. I'd also like to use substr(), etc. — JohnCand, Nov 03 '13 at 20:43
You can always pull a string out of a `vector`... See [How to construct a std::string from a std::vector?](http://stackoverflow.com/questions/5115166/how-to-construct-a-stdstring-from-a-stdvectorchar) — godel9, Nov 03 '13 at 20:50
Doesn't your question boil down to *"Can I manipulate a `char` object through a `unsigned char*` and vice versa?"* — dyp, Nov 03 '13 at 20:50
For me, the signedness of char doesn't matter most of the time. If I want a number between 0 and 255, and I have `char c`, then I use `c & 255`, `c & 255U` or `(unsigned char)c`. If I have an integer `i` (signed or unsigned, any size), I store its last 8 bits to `c` by using `c = i`. — pts, Nov 03 '13 at 21:09

score 18 · Accepted Answer · answered Nov 03 '13 at 21:01

The conversion static_cast<char>(uc) where uc is of type is unsigned char is always valid: according to 3.9.1 [basic.fundamental] the representation of char, signed char, and unsigned char are identical with char being identical to one of the two other types:

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

Converting values outside the range of unsigned char to char will, of course, be problematic and may cause undefined behavior. That is, as long as you don't try to store funny values into the std::string you'd be OK. With respect to bit patterns, you can rely on the nth bit to translated into 2ⁿ. There shouldn't be a problem to store binary data in a std::string when processed carefully.

That said, I don't buy into your premise: Processing binary data mostly requires dealing with bytes which are best manipulated using unsigned values. The few cases where you'd need to convert between char* and unsigned char* create convenient errors when not treated explicitly while messing up the use of char accidentally will be silent! That is, dealing with unsigned char will prevent errors. I also don't buy into the premise that you get all those nice string functions: for one, you are generally better off using the algorithms anyway but also binary data is not string data. In summary: the recommendation for std::vector<unsigned char> isn't just coming out of thin air! It is deliberate to avoid building hard to find traps into the design!

The only mildly reasonable argument in favor of using char could be the one about string literals but even that doesn't hold water with user-defined string literals introduced into C++11:

#include <cstddef>
unsigned char const* operator""_u (char const* s, size_t) 
{
    return reinterpret_cast<unsigned char const*>(s);
}

unsigned char const* hello = "hello"_u;

Thank you for your detailed answer. Your arguments are pretty convincing. I'm not sure I entirely understand the quoted paragraph yet, so please bare with my asking: 1) If I use the user-defined string literal you provided together with a u"" UTF-8 string that contains code points greater than 127, e.g. u8"â"_u, which is the byte sequence 0xC3 0xA2, and my char happens to be signed, this will result in both chars of that string to be negative before conversion. Does the standard guarantee that your converting literal will always result in exactly the same unsigned char numerical values ... — JohnCand, Nov 03 '13 at 22:13
... (i.e. will I always really get the numerical values 0xC3 0xA2 back)? 2) How would I correctly interface with OS functions that only accept char* or const char*? You probably have already touched on both questions in your answer, but those points aren't still entirely clear to me. — JohnCand, Nov 03 '13 at 22:13
@JohnCaC2: 1. The representations of `signed char` and `unsigned char` can be casted between each other without changing the bits. The upshot is that positive `signed char` values will have the same value as the `unsigned char` ones; how the negative values convert to `unsigned char` or the other way around isn't specified but the bit pattern still doesn't change. That is converting `signed char` to `unsigned char` and back is the identity function (likewise the other way around). 2. `reinterpret_cast(...)`. The point is, the compiler catches where it is needed. — Dietmar Kühl, Nov 03 '13 at 22:32

score 1 · Answer 2 · answered Nov 03 '13 at 20:46

1

Yes your assumptions are correct. Store binary data as a sequence of unsigned char in std::string.

answered Nov 03 '13 at 20:46

Damian

4,395
4
39
67

4

I will refrain from downvoting. But the canonical answer should include the words `unsigned` and `vector` here – sehe Nov 03 '13 at 20:49
1

I think it'd be helpful to explain the downvote and give some reasoning why this is bad. – Venemo Nov 03 '13 at 20:53

score -2 · Answer 3 · answered Jan 05 '15 at 00:37

-2

I've run into trouble using std::string to handle binary data in Microsoft Visual Studio. I've seen the strings get inexplicably truncated, so I wouldn't do this regardless of what the standards documents say.

answered Jan 05 '15 at 00:37

Guesty

1

5

Saying "regardless of what the standards documents say" is kind of satanism :D – Géza Török Oct 09 '15 at 09:23

Can I safely use std::string for binary data in C++11?

3 Answers3