2

I have a large dataset (100k+ items) I want to serialize using Boost.Serialization. This works satisfactory.

Now when working with even larger datasets the entire set doesn't fit into the memory anymore (I currently store a std::map with all data in the archive). Since I neither need random reads or writes and only need to access one item at a time I thought about streaming the dataset by directly saving instances to the archive (archive << item1 << item2 ...) and unpacking them one-by-one.

The other option would be to develop a new file format from scratch (something simple like <length><block> where each <block> corresponds to one Boost.Serialization archive), because I noticed that it doesn't seem possible to detect the end of an archive in Boost.Serialization without catching exceptions (input_stream_error should be thrown on a read past the end of the archive, I think).

Which option is preferable to the other? Abusing Serialization archives for streaming seems odd and hacky but has the big advantage of not re-inventing the wheel, while the file format wrapping archives feels cleaner but more error-prone.

dom0
  • 7,356
  • 3
  • 28
  • 50

1 Answers1

1

Using boost serialization for streaming is not abusing it and not odd either.

In fact, Boost Serialization has nothing but the streaming archive interface. So yes, the applicable approach would be to do as you said:

archive << number_of_items;
for(auto it = input_iterator(); it != end(); ++it)
    archive << *it;

In fact, very little stops you from doing the same in your serialize method. You could possibly even make it "automatic" by wrapping your stream into something (like an iterator_range?) and extending Boost Serialization to 'understand' these, like it 'understands' containers, arrays etc.

The file format approach is definitely not cleaner (from the library perspective) since it ruins the archive format isolation. The serialization library has been carefully designed to avoid knowledge about the archive representation, and it would be a breach of abstraction to circumvent this. Also see

Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Now just a small issue remains ; when creating the archive I don't know how many items are there to come thus I cannot serialize the number of items first. So far the best idea I came up with was to have the main data file and an accompanying file carrying metadata like the number of items. The second best idea would be to mess around with the format, like appending the number of items as simple 4 byte int and reading that manually beforehand when reading the archive. – dom0 May 12 '14 at 11:49
  • 1
    @dom0 No need for either. You can just use a [Sentinel Value](http://en.wikipedia.org/wiki/Sentinel_value) to indicate "EOS" (end-of-stream). – sehe May 12 '14 at 11:54
  • Well I really missed the forest for the trees here. Thanks! :) (I guess I was stuck with thinking the archive should be used kinda lika a container or something similar were one only stores one kind of object) – dom0 May 12 '14 at 13:22
  • Not exactly what was asked in the question but related: https://github.com/boostorg/serialization/issues/273. I think a fundamental difference with stream iterator is that in the philosophy of the archive there shouldn't have a (exposed) sentinel, even if there is one from a hypothetical underlying stream. The idea of the archives IMO is that it is always responsibility of the consumer to know how many elements to extract from the archive. In other words I am making archive_iterators compatible with `std::copy_n` but not necessarily with std::copy when using as a source. – alfC Nov 28 '22 at 23:12