Optimized struct serialization with template meta programming

Question

I want to serialize a struct on a 32bit aligned platform (armv7). The idea is to serialize the struct members (which I extract with some meta programming) into a std::byte array an then to copy into a std::uint32_t array (the output buffer).

The two serializer look like this:

// serialize to std::byte array
template <typename T, class OutputIterator, 
  typename std::enable_if_t<
    std::is_same<typename std::iterator_traits<OutputIterator>::value_type, 
      std::byte>::value, int> = 0>
std::size_t serialize(const T& value, OutputIterator iterator)
{
  std::size_t offset = 0; 
  visit_struct::for_each(value,
  [&](const char*, const auto& element) 
  {
    auto raw = reinterpret_cast<std::byte const*>(&element);
    auto type_size = sizeof(decltype(element));
    std::copy(raw, std::next(raw, type_size), std::next(iterator, offset));
    offset += type_size;
  });
  return offset;
}

// serialize to std::uint32_t array
template <typename T, class OutputIterator, 
  typename std::enable_if_t<
    std::is_same<typename std::iterator_traits<OutputIterator>::value_type, 
      std::uint32_t>::value, int> = 0>
std::size_t serialize(const T& value, OutputIterator iterator)
{
  constexpr std::size_t type_size = ext::mock_maker<T>::size;
  constexpr std::size_t aligned_type_size = (type_size + 4 - 1) / 4;
  std::array<std::byte, type_size> raw;
  serialize(value, raw.begin()); 
  auto raw_aligned = reinterpret_cast<std::uint32_t const*>(raw.data());
  std::copy(raw_aligned, std::next(raw_aligned, aligned_type_size), iterator);
  return aligned_type_size;
}

My hopes were that the compiler can somehow optimize away the intermediate representation as a std::byte array, but my test implementation suggests otherway. Is there a a way to achieve this elegantly?

On a side note, `(type_size + 3) / 4` isn't an overflow-safe method for ceil-dividing. See [How can you divide integers with floor, ceil and outwards rounding modes in C++?](https://stackoverflow.com/q/63436490/5740428) — Jan Schultke, Aug 18 '20 at 11:24
I thought my answer optimized away the extra copy, but I was wrong. — Filipp, Sep 12 '20 at 21:44
Interestingly, on x86 this optimizes wonderfully to just a bunch of `mov`s. On arm no matter what I do GCC still emits actual calls to `memcpy`. Is being generic important? If yes, consider writing an output iterator adapter that takes `std::byte` and packs them into `uint32_t`. If not, consider using `std::basic_streambuf` in your interface. — Filipp, Sep 12 '20 at 21:54

Optimized struct serialization with template meta programming

0 Answers0