Using std::string as a generic uint8_t buffer

Question

I am looking through the source code of Chromium to study how they implemented MediaRecorder API that encodes/records raw mic input stream to a particular format.

I came across interesting codes from their source. In short:

bool DoEncode(float* data_in, std::string* data_out) {
    ...
    data_out->resize(MAX_DATA_BTYES_OR_SOMETHING);
    opus_encode_float(
        data_in,
        reinterpret_cast<uint8_t*>(base::data(*data_out))
    );
    ...
}

So DoEncode (C++ method) here accepts an array of float and converts it to an encoded byte stream, and the actual operation is done in opus_encode_float() (which is a pure C function).

The interesting part is the Google Chromium team used std::string for an byte array instead of std::vector<uint_8> and they even manually cast to a uint8_t buffer.

Why would the guys from Google Chromium team do like this, and is there a scenario that using std::string is more useful for a generic bytes buffer than using others like std::vector<uint8_t>?

@πάνταῥεῖ I am not looking for a suggestion. The code mentioned is done by Google Chromium team. I am kinda wondering why they code like this. — Bumsik Kim, Nov 25 '18 at 18:47
Really the [only differences](https://stackoverflow.com/questions/6547922/what-are-differences-between-stdstring-and-stdvectorchar) are how the API looks as well as operation optimizations. At best, we could speculate the Chromium team decided they wanted to use string operations on the data more than collection operations. But that's the thing - it's purely speculation. Something else to keep in mind: just because it's something as prolific as Chromium or Google doesn't mean it's perfect. Sometimes, those projects do weird or blatantly incorrect things. — Qix - MONICA WAS MISTREATED, Nov 25 '18 at 19:00
@Qix Thank you for the comment. So there is no reason not to use `std::vector` things after all. — Bumsik Kim, Nov 25 '18 at 19:41

273K · Accepted Answer · 2018-11-26T16:19:36.410

The Chromium coding style (see below) forbids using unsigned integral types without good reason. External API is not such reason. Sizes of signed and unsigned chars are 1, so why not.

I looked at opus encoder API and it seems the earlier versions used signed char:

[out]   data    char*: Output payload (at least max_data_bytes long)

Although API uses unsigned chars now, the description still refers to signed char. So std::string for chars was more convenient for the earlier API and Chromium team didn't change the already used container after API was updated, they used cast in one line instead of updating tens other lines.

Integer Types

You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.

If your code is a container that returns a size, be sure to use a type that will accommodate any possible usage of your container. When in doubt, use a larger type rather than a smaller type.

Use care when converting integer types. Integer conversions and promotions can cause undefined behavior, leading to security bugs and other problems.

On Unsigned Integers

Unsigned integers are good for representing bitfields and modular arithmetic. Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler. In other cases, the defined behavior impedes optimization.

That said, mixing signedness of integer types is responsible for an equally large class of problems. The best advice we can provide: try to use iterators and containers rather than pointers and sizes, try not to mix signedness, and try to avoid unsigned types (except for representing bitfields or modular arithmetic). Do not use an unsigned type merely to assert that a variable is non-negative.

That makes much sense to me. So it's just historical and maintenance reason. I thought there might be a special secret behind of using `std::string` :) Thank you very much! — Bumsik Kim, Nov 25 '18 at 19:24
`The Chromium code style forbids using unsigned integral types` interesting; is there a reason for this? — Qix - MONICA WAS MISTREATED, Nov 26 '18 at 13:23
@Qix Shorty, unsigned types are used for bit patterns. Since unsigned type can not overflow it is difficult to diagnose overflow problems. Please look at my updated answer. — 273K, Nov 26 '18 at 16:22
But signed integer overflow is undefined. Is that preferable to a well defined overflow? I don't see any concrete reasons here. — Qix - MONICA WAS MISTREATED, Nov 27 '18 at 15:28
@Qix Sorry, I don't catch what you mind. Overflow of signed integer is UB. Overflow of unsigned integer is well defined as modulo. On most platforms overflow of signed can be diagnosed with assertions. — 273K, Nov 27 '18 at 16:52
Right, so why is signed favorable if it's U.B.? You mean just because an implementation is free to 'canary' it using assertions instead of having to adhere to the well-defined-ness of modulo overflow as is the case with unsigned integers? — Qix - MONICA WAS MISTREATED, Nov 28 '18 at 12:42

score 1 · Answer 2 · answered Nov 25 '18 at 22:08

1

We can only theorize.

My speculation: they wanted to use the built-in SSO optimization that exists in std::string but might not be available for std::vector<uint8_t>.

answered Nov 25 '18 at 22:08

David Haim

25,446
3
44
78

1

Short string optimization for requested buffer of 4000 bytes? – 273K Nov 26 '18 at 01:05

Using std::string as a generic uint8_t buffer

2 Answers2