Storing integral values in a byte sequence in C++

Question

I am implementing an LZW compression/decompression utility library and am in need of returning the compressed output in what I am using as:

using ByteSequence = std::vector<std::uint8_t>

The output format for the compressor will include the positions in the compressor's dictionary of various sequences found by the algorithm. For example, having 16-bit positions in the output would look like:

std::vector<std::uint16_t> pos{123, 385, /* ... */};

The output, however needs to be a ByteSequence, and it needs to be portable among architectures. What I am currently doing to convert the pos vector to the desired format is:

for (auto p : pos)
{
  std::uint8_t *bytes = (std::uint8_t *) &p;
  output.push_back(bytes[0]);
  output.push_back(bytes[1]);
}

This works, but only under the assumption that the keys will be 16-bit each and to be honest, it looks like a cheap trick to me.

How should I do this in a better, cleaner way? Thank you!

*"needs to be portable among architectures "* - not by doing what that code does, it isn't. At *best* that will spew different byte streams depending on endianness present on the encoder. Pull each octet out of the `uint16_t` using shifts and masks; first the high, then the low, and make sure your decoder knows that decision. — WhozCraig, Nov 24 '19 at 11:27
@WhozCraig I understand your point, that's what I thought of too. However, for for keys larger than 2 bytes, shoud I still extract the bytes in the same way? — Victor, Nov 24 '19 at 11:31
The same way as what you show here? Or byte by byte top-down as I described? No to the former, yes to the latter ? — WhozCraig, Nov 24 '19 at 11:38

score 1 · Answer 1 · answered Nov 24 '19 at 11:40

This should be portable, though possibly not so efficient as direct byte manipulation:

template<class T>
void number2bytes(std::vector<uint8_t>& bytes, T x)
{
    static_assert(std::is_integral<T>::value, "Integral required.");
    for (size_t i = 0; i < sizeof(T); ++i)
    {
        bytes.push_back(x & 0xFF);
        x >>= 8;
    }
}

The static_assert is added to protect from accidental passing some weird non-number type overloading & and >>=.

Evg · Accepted Answer · 2019-11-24T12:40:42.037

The way you extract bytes is undefined behaviour. The C++ standard [basic.lval] reads:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

. . .

a char, unsigned char, or std::byte type.

std::uint8_t is not in this list, and AFAIK there is no guarantee that std::uint8_t and unsigned char are the same type.

A conversion function might look like:

template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
    out.reserve(out.size() + in.size() * sizeof(T));
    for (const T& i : in) {
        std::uint8_t buff[sizeof(T)];
        std::memcpy(buff, &i, sizeof(T));
        std::copy(std::begin(buff), std::end(buff), std::back_inserter(out));
    }
}

Alternative implementation without back_inserter:

template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
    const auto old_size = out.size();
    out.resize(old_size + in.size() * sizeof(T));
    auto dest = out.data() + old_size;
    for (const T& i : in) {
        std::memcpy(dest, &i, sizeof(T));
        dest += sizeof(T);
    }
}

Beware about endianness. It should be taken into account either in the forward conversion or in the backward one.

Thank you for your answer! Could you please tell me why this is undefined behavior? I saw [this](https://stackoverflow.com/a/4181991/1673776) answer pointing this exact code fragment to detect endianness. Also, could you please elaborate a little on *endianness should be taken into account*? I am aware of what will happen when the data will be read on a system with different endianness that the one where it was encoded, but I am not sure what's the best way to compensate. — Victor, Nov 24 '19 at 12:07
I see what you're saying. I will adjust my implementation to use `std::byte` wherever appropriate. Thank you very much! — Victor, Nov 24 '19 at 12:32
@Victor, to compensate for endianness you typically manually reverse bytes before interpreting them as a value (if endiannesses are different). — Evg, Nov 24 '19 at 12:39

Storing integral values in a byte sequence in C++

2 Answers2