I have a large dataset (100k+ items) I want to serialize using Boost.Serialization. This works satisfactory.
Now when working with even larger datasets the entire set doesn't fit into the memory anymore (I currently store a std::map
with all data in the archive). Since I neither need random reads or writes and only need to access one item at a time I thought about streaming the dataset by directly saving instances to the archive (archive << item1 << item2 ...
) and unpacking them one-by-one.
The other option would be to develop a new file format from scratch (something simple like <length><block>
where each <block>
corresponds to one Boost.Serialization archive), because I noticed that it doesn't seem possible to detect the end of an archive in Boost.Serialization without catching exceptions (input_stream_error
should be thrown on a read past the end of the archive, I think).
Which option is preferable to the other? Abusing Serialization archives for streaming seems odd and hacky but has the big advantage of not re-inventing the wheel, while the file format wrapping archives feels cleaner but more error-prone.