2

I am implementing an LZW compression/decompression utility library and am in need of returning the compressed output in what I am using as:

using ByteSequence = std::vector<std::uint8_t>

The output format for the compressor will include the positions in the compressor's dictionary of various sequences found by the algorithm. For example, having 16-bit positions in the output would look like:

std::vector<std::uint16_t> pos{123, 385, /* ... */};

The output, however needs to be a ByteSequence, and it needs to be portable among architectures. What I am currently doing to convert the pos vector to the desired format is:

for (auto p : pos)
{
  std::uint8_t *bytes = (std::uint8_t *) &p;
  output.push_back(bytes[0]);
  output.push_back(bytes[1]);
}

This works, but only under the assumption that the keys will be 16-bit each and to be honest, it looks like a cheap trick to me.

How should I do this in a better, cleaner way? Thank you!

Victor
  • 13,914
  • 19
  • 78
  • 147
  • *"needs to be portable among architectures "* - not by doing what that code does, it isn't. At *best* that will spew different byte streams depending on endianness present on the encoder. Pull each octet out of the `uint16_t` using shifts and masks; first the high, then the low, and make sure your decoder knows that decision. – WhozCraig Nov 24 '19 at 11:27
  • @WhozCraig I understand your point, that's what I thought of too. However, for for keys larger than 2 bytes, shoud I still extract the bytes in the same way? – Victor Nov 24 '19 at 11:31
  • The same way as what you show here? Or byte by byte top-down as I described? No to the former, yes to the latter ? – WhozCraig Nov 24 '19 at 11:38

2 Answers2

1

This should be portable, though possibly not so efficient as direct byte manipulation:

template<class T>
void number2bytes(std::vector<uint8_t>& bytes, T x)
{
    static_assert(std::is_integral<T>::value, "Integral required.");
    for (size_t i = 0; i < sizeof(T); ++i)
    {
        bytes.push_back(x & 0xFF);
        x >>= 8;
    }
}

The static_assert is added to protect from accidental passing some weird non-number type overloading & and >>=.

aparpara
  • 2,171
  • 8
  • 23
1

The way you extract bytes is undefined behaviour. The C++ standard [basic.lval] reads:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

. . .

  • a char, unsigned char, or std::byte type.

std::uint8_t is not in this list, and AFAIK there is no guarantee that std::uint8_t and unsigned char are the same type.


A conversion function might look like:

template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
    out.reserve(out.size() + in.size() * sizeof(T));
    for (const T& i : in) {
        std::uint8_t buff[sizeof(T)];
        std::memcpy(buff, &i, sizeof(T));
        std::copy(std::begin(buff), std::end(buff), std::back_inserter(out));
    }
}

Alternative implementation without back_inserter:

template<typename T>
void convert_forward(const std::vector<T>& in, std::vector<std::uint8_t>& out) {
    const auto old_size = out.size();
    out.resize(old_size + in.size() * sizeof(T));
    auto dest = out.data() + old_size;
    for (const T& i : in) {
        std::memcpy(dest, &i, sizeof(T));
        dest += sizeof(T);
    }
}

Beware about endianness. It should be taken into account either in the forward conversion or in the backward one.

Evg
  • 25,259
  • 5
  • 41
  • 83
  • Thank you for your answer! Could you please tell me why this is undefined behavior? I saw [this](https://stackoverflow.com/a/4181991/1673776) answer pointing this exact code fragment to detect endianness. Also, could you please elaborate a little on *endianness should be taken into account*? I am aware of what will happen when the data will be read on a system with different endianness that the one where it was encoded, but I am not sure what's the best way to compensate. – Victor Nov 24 '19 at 12:07
  • I see what you're saying. I will adjust my implementation to use `std::byte` wherever appropriate. Thank you very much! – Victor Nov 24 '19 at 12:32
  • 1
    @Victor, to compensate for endianness you typically manually reverse bytes before interpreting them as a value (if endiannesses are different). – Evg Nov 24 '19 at 12:39