Understanding serialization of polymorphic objects in C++

Question

EDIT: I realised that the code below is a good example of what you cannot do in C++ with anything that is not a POD.

There doesn't seem to exist a way to escape from having a typeid into the classes and do some sort of switch or table lookup (both of which must be carefully maintained) on the receiver side to rebuild the objects.

I have wrote some toy code to serialise objects and two separate mains to write/read them to/from a file.

common.h:

#include <iostream>
using namespace std;

template <typename T>
size_t serialize(std::ostream & o, const T & t) {
  const char * bytes = reinterpret_cast<const char*>(&t);
  for (size_t i = 0; i < t.size(); ++i) {
    o << bytes[i];
  }
  return t.size();
}

size_t deserialize(std::istream & i, char * buffer) {
  size_t len = 0;
  char c;
  while (i.get(c)) {
    buffer[len] = c;
    ++len;
  }
  return len;
}

// toy classes
struct A {
  int a[4];
  virtual ~A() {}
  virtual void print(){cout << "A\n";}
  virtual size_t size() const {return sizeof(*this);}
};
struct B: A {
  int b[16];
  virtual ~B() {}
  virtual void print(){cout << "B\n";}
  virtual size_t size() const {return sizeof(*this);}
};

out.cpp:

#include <fstream>
#include "common.h"

int main() {
  B b;
  A& a = *static_cast<A*>(&b);
  ofstream ofile("serial.bin");
  cout << "size = " << serialize(ofile, a) << endl;
  ofile.close();
  return 0;
}

in.cpp:

#include <fstream>
#include "common.h"

int main() {
  char buffer[1024];
  ifstream ifile("serial.bin");
  cout << "size = " << deserialize(ifile, buffer) << endl;
  ifile.close();
  A& a = *reinterpret_cast<A*>(buffer);
  a.print();
  return 0;
}

If my classes have no virtual functions, this appears to work fine, but in.cpp crashes when they do.

My understanding is that the vptr created by out.cpp is not fine to be used by in.cpp.

Is there something that could be done, possibly avoiding to manually create and maintain a vtable?

I'd rather recommend to look out for an appropriate serialization library (e.g. boost archive), to do that for you. You basically have undefined behavior, the raw casting would only work with POD types. — user0042, Aug 06 '17 at 09:09
@user0042 the code should go to a microcontroller, so no chances to bring in Boost :( plus I would like to understand. — DarioP, Aug 06 '17 at 09:10
In that case I would recommend to use Google Protobuf along with the Nano-PB C interface. You have to provide the mapping code with classes yourself. And no, there's no way to manage a vtable manually somehow. — user0042, Aug 06 '17 at 09:14
The problem is not just with the vtable pointer, but pointers in general. If you save a pointer (or reference) to disk and load it in another program - what are the odds that they are now valid? As close to zero as you can get! — Bo Persson, Aug 06 '17 at 09:21
You cannot create an object by reinterpreting some bytes. (If you could, we could replace copy constructors by calls to memcpy). Use a constructor to create an object. — n. m. could be an AI, Aug 06 '17 at 09:26
@n.m. It looks like you can https://stackoverflow.com/questions/3021333/can-i-use-memcpy-in-c-to-copy-classes-that-have-no-pointers-or-virtual-functio but virtual functions might be the problem... — DarioP, Aug 06 '17 at 09:31

EmDroid · Accepted Answer · 2017-08-06T10:00:18.190

If you absolutely cannot use any library (as there still might be some options, even for embedded platforms), one option of serializing polymorphic classes might be to provide virtual serialize/deserialize methods.

In this case for example:

struct A {
  int a[4];
  virtual ~A() {}
  virtual void print(){cout << "A\n";}
  virtual size_t size() const {return sizeof(*this);}
  virtual void serialize(std::ostream & o) const
  {
      for (int i = 0; i < 4; ++i) o << a[i];
  }
  virtual void deserialize(std::istream & i)
  {
      for (int i = 0; i < 4; ++i) i >> a[i];
  }
};
struct B: A {
  int b[16];
  virtual ~B() {}
  virtual void print(){cout << "B\n";}
  virtual size_t size() const {return sizeof(*this);}
  virtual void serialize(std::ostream & o) const
  {
      A::serialize(o);
      for (int i = 0; i < 16; ++i) o << b[i];
  }
  virtual void deserialize(std::istream & i)
  {
      A::deserialize(i);
      for (int i = 0; i < 16; ++i) i >> b[i];
  }
};

// prg 1
B b;
b.serialize(ofile);

// prg 2
B.b;
b.deserialize(ifile);

Basically, you'll write the particular members to the file one by one.

However, this is just for simple case you actually know what class do you expect in the file. If there can be multiple classes, you'd need to also write some class identification (e.g. some struct serialization id) to know which class to read. Also, if the classes might change, you might need some kind of versioning the classes.

Pointers are also tricky as mentioned, especially because they can be NULL - you could first write a bool (byte) to determine if the pointer is NULL, then the contents, if any. Similar way you can serialize/deserialize e.g. std::string or std::vector: First write the length, then the items. When reading, you'd read the length, reserve or resize the string/vector, and then read the items.

Another issue might be if the file is transferred to different machine, which might have different byte order (endian). So as you can see, if there is still some library availabe, it is better to use it instead of writing everything from scratch.

To add for the polymorphic deserialization (as I can see you are using just the A on the reader side), you can have for example:

struct A {
  ...
  virtual int get_serialization_id() const = 0;
};
struct B: A {
  ...
  static const int SERIALIZATION_ID = 1; // needs to be different in every polymorphic class
  virtual int get_serialization_id() const
  { return SERIALIZATION_ID; }
};

void serialize(std::ostream & o, const A & a)
{
  o << a.get_serialization_id();
  o << a.serialize();
}

std::unique_ptr<A> deserialize(std::istream & i)
{
  std::unique_ptr<A> result;
  int id;
  i >> id;
  switch (id)
  {
  case B::SERIALIZATION_ID:
    result = std::make_unique<B>();
    break:
  case C::SERIALIZATION_ID:
    result = std::make_unique<C>();
    break:
  ...
  default:
    // leave NULL or throw exception
    return result;
  }
  result->deserialize(i);
  return result;
}

To avoid the switch, you could go more fancy and provide some kind of factory registration (registering serialization IDs along with the class factories in a map, then use the registry to find the factory and create the class). You can go pretty fancy with deserialization :).

And note that there are cases which are really difficult to solve (e.g. recreating instance structures with shared pointers pointing to the same instance from multiple other instances, etc.).

But now the receiver has to know the exact type which misses the point of sending a polymorphic object, am I wrong? — DarioP, Aug 06 '17 at 09:44
Yes, in the real case scenario you'd need some identification of the class (e.g. serialization id), as I also mentioned. — EmDroid, Aug 06 '17 at 09:46
Technically you can have some sort of [serialization id]/[class factory] registry to avoid the switch, also added a note to the answer. — EmDroid, Aug 06 '17 at 10:05

Understanding serialization of polymorphic objects in C++

1 Answers1