33

Which is a better c++ container for holding and accessing binary data?

std::vector<unsigned char>

or

std::string

Is one more efficient than the other?
Is one a more 'correct' usage?

kalaxy
  • 1,608
  • 1
  • 14
  • 14
  • Have a look to this post about using char vs unsigned char for binary data: http://stackoverflow.com/questions/277655/why-do-c-streams-use-char-instead-of-unsigned-char – Fernando N. Oct 13 '09 at 21:29
  • For an example of when std::string is used for binary data, see [Google Protobuf](https://protobuf.dev/programming-guides/proto3/#scalar) – Alex Che May 22 '23 at 15:49

9 Answers9

32

You should prefer std::vector over std::string. In common cases both solutions can be almost equivalent, but std::strings are designed specifically for strings and string manipulation and that is not your intended use.

David Rodríguez - dribeas
  • 204,818
  • 23
  • 294
  • 489
  • "Say that the default character traits determine that 'a' and 'á' are equivalent" That is a bad asumption. See the answer I wrote as continuation to this comment. – Fernando N. Oct 13 '09 at 07:59
  • I rechecked, and you are right in that the standard does define the specialization `char_traits` and with the standard specialization, assignment, comparisons and ordering are defined as the equivalent for the built-in char type. – David Rodríguez - dribeas Oct 13 '09 at 08:25
  • So with default char_traits std::string would compare no differently than the corresponding std::vector? – kalaxy Oct 13 '09 at 22:57
  • @kalaxy: correct. Anyway, each class was meant for a purpose, and `std::vector` better suites what you want from a buffer, so if only because of the intention is clearer (as fnieto points out in his answer) I would prefer `std::vector` – David Rodríguez - dribeas Oct 14 '09 at 06:20
  • @DavidRodríguez-dribeas: I edited your answer, since I understand (from the comments) that the previous version was incorrect. – user541686 Jul 06 '12 at 08:44
15

Both are correct and equally efficient. Using one of those instead of a plain array is only to ease memory management and passing them as argument.

I use vector because the intention is more clear than with string.

Edit: C++03 standard does not guarantee std::basic_string memory contiguity. However from a practical viewpoint, there are no commercial non-contiguous implementations. C++0x is set to standardize that fact.

Fernando N.
  • 6,369
  • 4
  • 27
  • 30
  • 1
    from Sgi: "The basic_string class represents a Sequence of characters. It contains all the usual operations of a Sequence, and, additionally, it contains standard string operations such as search and concatenation.". Why is that incorrect? I agree it is not the best aproach (as I state in my answer) but it is not incorrect. – Fernando N. Oct 13 '09 at 08:47
  • So string works just as well as the vector because it in a sense extends the functionality of a vector yet the only functionality I will need ([] or the like) is contained in both? (Yes I realize that string doesn't actually inherit from vector.) – kalaxy Oct 13 '09 at 19:26
  • 1
    Yes, but conceptually is a worse option and have methods that could not have sense for a buffer. If you only want memory management and operator[], why to use a class so complex as std::string. – Fernando N. Oct 13 '09 at 21:24
3

Is one more efficient than the other?

This is the wrong question.

Is one a more 'correct' usage?

This is the correct question.
It depends. How is the data being used? If you are going to use the data in a string like fashon then you should opt for std::string as using a std::vector may confuse subsequent maintainers. If on the other hand most of the data manipulation looks like plain maths or vector like then a std::vector is more appropriate.

Martin York
  • 257,169
  • 86
  • 333
  • 562
2

For the longest time I agreed with most answers here. However, just today it hit me why it might be more wise to actually use std::string over std::vector<unsigned char>.

As most agree, using either one will work just fine. But often times, file data can actually be in text format (more common now with XML having become mainstream). This makes it easy to view in the debugger when it becomes pertinent (and these debuggers will often let you navigate the bytes of the string anyway). But more importantly, many existing functions that can be used on a string, could easily be used on file/binary data. I've found myself writing multiple functions to handle both strings and byte arrays, and realized how pointless it all was.

Mike Weir
  • 3,094
  • 1
  • 30
  • 46
1

This is a comment to dribeas answer. I write it as an answer to be able to format the code.

This is the char_traits compare function, and the behaviour is quite healthy:

static bool
lt(const char_type& __c1, const char_type& __c2)
{ return __c1 < __c2; }

template<typename _CharT>
int
char_traits<_CharT>::
compare(const char_type* __s1, const char_type* __s2, std::size_t __n)
{
  for (std::size_t __i = 0; __i < __n; ++__i)
if (lt(__s1[__i], __s2[__i]))
  return -1;
else if (lt(__s2[__i], __s1[__i]))
  return 1;
  return 0;
}
Fernando N.
  • 6,369
  • 4
  • 27
  • 30
  • Is this behavior well defined in the standard? – gnud Oct 13 '09 at 08:03
  • +1: @gnud: Not in general, but fnieto is right (I just checked it) in that the standard defines the specialization of traits for char, where `assign`, `eq` and `lt` must be defined as builtin operators =, == and < for type `char`. – David Rodríguez - dribeas Oct 13 '09 at 08:22
0

As far as readability is concerned, I prefer std::vector. std::vector should be the default container in this case: the intent is clearer and as was already said by other answers, on most implementations, it is also more efficient.

On one occasion I did prefer std::string over std::vector though. Let's look at the signatures of their move constructors in C++11:

vector (vector&& x);

string (string&& str) noexcept;

On that occasion I really needed a noexcept move constructor. std::string provides it and std::vector does not.

Arnaud
  • 3,765
  • 3
  • 39
  • 69
-1

If you just want to store your binary data, you can use bitset which optimizes for space allocation. Otherwise go for vector, as it's more appropriate for your usage.

Jacob
  • 34,255
  • 14
  • 110
  • 165
  • 2
    bitset is not a good choice. How are you going to get the data back out without casting? How do you easily read a byte out of a bitset? This isn't the right application for bitset. – Brian Neal Oct 12 '09 at 20:04
  • Hence, "if you just want to store your binary data". This is important in some memory intensive processes - for e.g. when working with binary images, you'd want to store them temporarily and then reuse them later. – Jacob Oct 12 '09 at 20:26
  • How often do you actually "just store data" though? If I was going to store it I would use a file or just an array or vector. What advantages does bitset have for storage? How do you even get your binary data into a bitset? Bitset has really lousy contructors for that purpose. Have you actually tried to do this? Bitset has a default constructor, a constructor that takes an unsigned long, and one that takes a string. Not real convenient for this purpose. – Brian Neal Oct 12 '09 at 23:18
  • Storing it in an array or a vector would defeat the purpose of storage since we're using bitset for it's optimized allocation of *bits*. Passing a string of bits is not that difficulty. As for applications, binary images are one: an RGB 1024x768 is 2.25MB stored as uchars - imagine storing a small batch of frames (which is **not** unrealistic). Also, r/w to files is much slower than storing it temporarily as a bitset. Additionally, I did mention that if storage wasn't the prime motivation, `vector` is better. – Jacob Oct 13 '09 at 00:15
  • Bitset is not optimized for storage of bits. In fact, the standard makes no guarantees on how the bits are actually stored. Bitset is used when you need, what else, a set of bits, as for example, flag manipulation. Please tell me how you are going to store a binary image 2.25 MB in size in a bit set. There is nothing more optimized for space allocation than an array of unsigned char. – Brian Neal Oct 13 '09 at 00:42
  • Read the line about optimizing space allocation: http://www.cplusplus.com/reference/stl/bitset/ – Jacob Oct 13 '09 at 02:53
  • Jacob, this is silly. You claim that bitset is useful for storing binary data. This is absurd. Bitset is not a container, and it has no suitable constructors for being initialized from raw data, unlike vector or string. Are you seriously telling me you would construct a string of ASCII 1's and 0's from 2.25 MB of binary data in order to construct a bitset??? That's a pretty big string. Think about it. Bitset was not meant for this purpose. The C++ standard does not even specify how bitset internally stores data, unlike vector, which the standard guarantees to be contiguous. – Brian Neal Oct 13 '09 at 03:01
  • There is no more compact way to store data in memory in C++ than with an array of unsigned char. The standard guarantees that you can treat the memory inside of vector as a contiguous array. You cannot (portably) do that with bitset. You can't (portably) memcpy raw data into a bitset either. – Brian Neal Oct 13 '09 at 03:11
  • `bitset` is efficient at storing binary data - I never said `bitset` was an STL container. And creating that "pretty big string" (which would use `unsigned char`, btw) is trivial. Also, everything I've seen till now (sample code on my compiler, Googling and Effective STL (pg.70)) indicates that bitset **does** store binary data effectively. And yes, there *is* a better way to store binary data, and it's `bitset` - have you tried it out on your compiler? It's only two lines of code. – Jacob Oct 13 '09 at 03:56
  • 1
    To initialize a 2.25 MB bitset, you need a 10 MB string; each *character* in the string represents just *one bit* in the bitset. Also, you need to know how many bits you'll need *at compile time*. There are just two ways of extracting a bitset's contents en masse: to_ulong is useless if you have more bits than fit in a long, and to_string returns a string of zeroes and ones that can't easily be used in any other data type. So, yes, if all you want to do is *store* a preset amount of data, bitset might be OK. If you want the data back, or if the size is uncertain, then it's a lousy choice. – Rob Kennedy Oct 13 '09 at 07:07
  • Agreed, if the size is uncertain, it's lousy, but getting the data back is `not` since it's the same as storing the data, you can use `bitset::to_string`. And yes, you need a 10 MB string - that's the whole point of using bitset. Suppose you have a array of bits which you've obtained as unsigned chars after some logical operation perhaps, and it's 10MB and you want to store it in memory - what do you do? `bitset`! – Jacob Oct 13 '09 at 11:50
  • Ha-ha, you keep messing with your 10 MB string and I'll use my 2 MB vector. I still have absolutely no clue why you feel bitset is good for "storing" data. Why is it better than vector? And what the heck are you supposed to do with it while it is in bitset? And yes I have tried to use bitset for binary data. I actually wrote my own implementation of bitset and gave it constructors and accessors to get the raw data in and back out for embedded systems. But I need it because I was using it as it was intended, as a set of bit flags, not storage. – Brian Neal Oct 13 '09 at 23:19
  • The fact that bitset doesn't provide (begin, end) constructors and raw data accessors makes it absolutely terrible for storing data. Your only way in or out for large numbers of bits is string? You also cannot say it is optimized for storage. As I have said several times, the standard does not guarantee how bitset should store data, unlike vector. For all you know, your bitset may store 1 bit in every byte for speed. I know of no implementation that actually does this, but that's why you can't count on it or portably memcpy it around. P.S. Don't rely on cplusplus.com for everything. – Brian Neal Oct 13 '09 at 23:36
  • I don't think you understand what I'm saying. Your 2MB vector which is supposed to represent 2Mbits can be more efficiently stored on *most implementations* (could you point out an implementation which performs so poorly? I can't find one!) using bitset. How? You throw it in to the constructor and poof! you get a bitset which has stored your data by possibly a factor of 8. Also, all I've said, *repeatedly* is, **storage**. Nothing about accessors, etc. etc. – Jacob Oct 14 '09 at 05:49
  • @Jacob: I think you have a communications problem here with Brian. If you read a 1024x768@24 bit raw image you will have 2.25MBytes of information. The most that a bitset can pack the data is one bit for each element, and at that level it will require exactly 2.25MBytes of memory, just as a vector of bytes. Bitset will be an advantage if each of your original elements is a bit (at this point you can note that `std::vector` is an specialization that is optimized for space, not that the standard committee is happy about it), so at that point it won't even take more memory than a bitset. – David Rodríguez - dribeas Oct 14 '09 at 06:32
  • ... Now, if your intended use is testing flags, using a vector of bytes will be more cumbersome as it will require extracting each byte and then testing each bit for reading, extracting the byte, setting the bit and inserting the result back for setting a bit. At that point using a bitset or vector will simplify user code. But the thing is that if the elements you work with are not bits but rather bytes, then a vector is more efficient cpu wise than a bitset and is not less efficient memory wise. In most cases, when people talk about storing binary data they refer to bytes, not bits. – David Rodríguez - dribeas Oct 14 '09 at 06:35
-1

Compare this 2 and choose yourself which is more specific for you. Both are very robust, working with STL algorithms ... Choose yourself wich is more effective for your task

Davit Siradeghyan
  • 6,053
  • 6
  • 24
  • 29
-1

Personally I prefer std::string because string::data() is much more intuitive for me when I want my binary buffer back in C-compatible form. I know that vector elements are guaranteed to be stored contiguously exercising this in code feels a little bit unsettling.

This is a style decision that individual developer or a team should make for themselves.

Oleg Zhylin
  • 1,290
  • 12
  • 18
  • You prefer using a string for non-string data? Rather than using the container *designed* for contiguous storage of data of any type? – jalf Oct 12 '09 at 22:54
  • 2
    Lets not forget that this is the matter of style. Perfectly workable and standard compliant code for binary buffers can be created with either of these classes. I would argue that vector is not designed to be a binary buffer either. It is compatible, but you will have to revert to algorithms or C tricks to get the job done. Not all string operations are safe, but some of them are quite useful to make the code cleaner and more maintainable. – Oleg Zhylin Oct 13 '09 at 00:32
  • Vector is quite suited to store binary data, e.g. vector v(256). I don't consider &v[0] a "C trick". – Brian Neal Oct 13 '09 at 00:46
  • 1
    No, &v[0] is fine, and so is s.data(). What is vector's alternative for string s; s.assign(BinaryBuffer, BinaryBufferSize); ? – Oleg Zhylin Oct 13 '09 at 00:58
  • vector v; v.assign(BinaryBuffer, BinaryBuffer + BinaryBufferSize); – Brian Neal Oct 13 '09 at 01:16
  • Of course vector has a constructor explicity for that purpose too: vector v(first, last); – Brian Neal Oct 13 '09 at 01:17
  • Thus you have to explicitly parametrize vector with unsigned char and make sure pointer arithmetics works correctly in BinaryBuffer + BinaryBufferSize. Looks like more pitfalls then string option to me. As I said in the beginning, this is clearly a style issue. There's no such thing as "universal style". Teams or individual developers should decide which option they like better and adhere to that. – Oleg Zhylin Oct 13 '09 at 11:30
  • Um, string is already parameterized by char, did you notice? So typedef your vector if that makes you feel weird. String is meant for strings of characters, not raw binary data. String is a much more heavy-weight solution. – Brian Neal Oct 13 '09 at 23:22
  • And what do you mean by making sure pointer arithmetic works correctly? Vector uses the 2-iterator (begin, end) idiom like the rest of the STL (and string). Hardly more pitfalls than string. – Brian Neal Oct 13 '09 at 23:24
  • Pointer arithmetic may play tricks if BinaryBuffer is not (unsigned char*). Could you please elaborate on what makes string _much_ more heavyweight? – Oleg Zhylin Oct 14 '09 at 00:38