Is it inappropriate to use a std::string to store binary data?

Question

I was surprised to see in this question that someone modified a working snippet just because, as the author of the second answer says:

it didn't seem appropriate to me that I should work with binary data stored within std::string object

Is there a reason why I should not do so?

Is there a reason you *would* do it, instead of using, say, `std::vector`? — juanchopanza, May 05 '14 at 14:43
In that very case, yes: the snippet is working, why change it? — qdii, May 05 '14 at 14:44
Because `std::string` is allowed to do copy-on-write in C++03, and adds a `\0` at the end of the data block? It is not designed for storing a block of arbitrary binary data. It is designed to implement the concept of a character string. — juanchopanza, May 05 '14 at 14:46
@juanchopanza COW (done by few as proven to be mostly bad here) is no longer allowed in C++11, and how does the added zero-terminator hinder me? The only good point is not using string when theres no real text. — Deduplicator, May 05 '14 at 14:56
@Deduplicator Because if you want to store binary data, you usually want to be fully in control of what you store. Why do you want something that adds an extra `\0` at the end? That makes no sense. — juanchopanza, May 05 '14 at 16:14
@juanchopanza: You are in full control. So there is a 0 stored after your data (not included in the count), how does that hinder you? There could be garbage instead, who cares? — Deduplicator, May 05 '14 at 16:21
@Deduplicator it hinders you if you don't want to have extra data appended when you don't actually need it. How can that be so hard to understand? — juanchopanza, May 05 '14 at 16:54
@juanchopanza let's say I am using C++03, how is copy-on-write a reason why I should perform a `string` -> `vector` change? — qdii, May 05 '14 at 17:02
@qdii For example, it wouldn't be safe to use `&s[0]` to access the underlying data block in cases where it would be OK with an `std::vector`. — juanchopanza, May 05 '14 at 17:08

6502 · Accepted Answer · 2014-05-05T20:24:54.693

7

For binary data in my opinion the best option is std::vector<unsigned char>.

Using std::string while technically works sends to the user the wrong message that the data being handled is text.

On the other side being able to accept any byte in a string is important because sometimes you know the content is text, but in an unknown encoding. Forcing std::string to contain only valid and decoded text would be a big limitation for real world use.

This kind of limitation is one of the few things I don't like about QString: this limitation makes it impossible for example to use a file selection dialog to open a file if the filename has a "wrong" (unexpected) encoding or if the encoding is actually invalid (it contains mistakes).

edited May 05 '14 at 20:24

answered May 05 '14 at 14:46

6502

112,025
15
165
265

Perhaps add that they can only accept a specific superset of proper taxt because they don't go for UTF-8? – Deduplicator May 05 '14 at 14:51
@Deduplicator: QString is not made of byte, but of unicode characters. The problem is that sometimes is just impossible to go from bytes to unicode characters because you don't know the encoding. Linux filesystem is encoding agnostic so you can have both iso-8859 and utf-8 encoded filenames in the same directory. This is certainly not perfect (whatever you try you will see strange chars on the screen), but not being able to open a file because you cannot store its name in a string is much worse. – 6502 May 05 '14 at 14:53
1

With unicode I think you meant UTF-16. And UTF-8 allows you to ignore that situation by just pretending the input is valid. Well, it's a side-show to a side-issue in an example, so not really important. – Deduplicator May 05 '14 at 14:58
@Deduplicator: I mean unicode because QString limit is that it only accepts decoded text as content. It doesn't matter if it's 16 or 32 bit. The problem is that sometimes you're provided with bytes that represent text that you cannot decode (because for example you don't know the encoding used, or because there are encoding mistakes). For many operations this would be totally irrelevant (e.g. to pass those bytes to `fopen` as a filename) and requiring decoded text only creates a usability problem. Writing grep with qt would be for example difficult because regexp only work with qstring. – 6502 May 05 '14 at 17:12

Is it inappropriate to use a std::string to store binary data?

1 Answers1