Store multiple serialization formats in one file? AKA avoiding serialization bloat for collections

Question

Simple question, for which so far I haven't easily found an answer. (Similar Questions suggested as I type this point to nothing relevant - but I can't believe I'm the only person facing this challenge.)

Say I have an object in memory which contains

simple types (e.g. name, computer name, creation date, configuration, etc.); and
a collection of some kind (e.g. time series of statistical measures, e.g. moving average)

To serialize these it makes sense

to store the simple types in a fully-featured serialization format e.g. JSON, XML, YAML
to store the collection values in CSV file (to save needless repetition of the tags for each entry)

But this means I end up with two files. It is better if all the info is in one file, so that unambiguously the reader can understand that (2) results from (1). Also easier to maintain in the file system.

I don't want to combine into a BLOB as this would lose human readability.

Is there a simple technique for combining the JSON in (1) and CSV for (2) into one file?

My first guess would be to have (say) XML tags to separate the different types. e.g.

<SimpleTypes format="JSON">
   [JSON for simple types]
</SimpleTypes>
<Collection format="CSV" type="Dictionary" name="DailySalesTotal">
   [CSV for collection]
</Collection>
<Collection format="CSV" type="Dictionary" name="DailyFootfallInStore">
   [CSV for collection]
</Collection>

Then just open up the file, parse the XML into separate JSON and CSV sections and handle as normal.

Is this a sensible approach? Any risks?

Or is there a library anywhere for this? I'm using C# so would need a .NET library.

Good question. I believe text files should be both understandable by the program and the person who is debugging the program. — jdweng, Jun 13 '18 at 09:35
Yes, absolutely, otherwise debugging becomes unnecessary difficult. You can always switch to a BSON or BLOB format later, once a human-readable approach is working. But for dev and dev-testing, human-readability is essential. — Bit Racketeer, Jun 13 '18 at 10:29
Look at the following posting from earlier this week which is related to your question : https://stackoverflow.com/questions/50792632/multiple-names-for-xml-elements-and-attributes/50795689#comment88620000_50795689 — jdweng, Jun 13 '18 at 10:51
I've heard rumours that it is possible to serialize a collection to CSV *within* a JSON field - but have not yet found confirmation. If this is true, it solves the problem of bloat. Can anyone confirm/deny this? — Bit Racketeer, Jun 13 '18 at 12:24
It looks horrible to me. Three parsers instead of one, three lots of different escaping conventions layered on top of each other: scope for a total mess. Either XML or JSON on its own will do the job much better than this hybrid. — Michael Kay, Jun 13 '18 at 13:45
I'm voting to close this question as off-topic because it is asking for design advice rather than programming help. — Michael Kay, Jun 13 '18 at 13:47
I'd like both programming advice and design advice. If I can't ask for design advice here, where should I ask for it? Secondly, I've tried XML or JSON on its own as you suggest - but the result is files that are nearly 100MB in size because all the element names are repeated. This is why I am thinking of CSV. — Bit Racketeer, Jun 13 '18 at 18:16
CSV equivalent files are 300kB (0.3% of the size of JSON or XML) but obviously don't have the simple types. So I am stuck as to what to do. — Bit Racketeer, Jun 13 '18 at 18:46

score 1 · Accepted Answer · answered Jun 13 '18 at 13:48

I challenge the given reasons why this would make sense.

Mainly, the proposed solution of using XML is just using another serialization format. Let's see if we can achieve the stated goal:

to save needless repetition of the tags for each entry

with YAML. Borrowing from your example, assume we have name and computer_name as „simple“ data, and a list of times with some data attached as „collection data“. The trivial approach would look something like this:

name: My Name
computer_name: My Computer
collection:
- time: 1:30
  data: foo
- time: 2:20
  data: bar

There are no repeating tags involved. When deserializing into the proper type, YAML will know that the value of collection: will be a list of data points without explicit tags. However, we have an overhead because we specify the field names time and data every time. So let's try and get rid of them:

name: My Name
computer_name: My Computer
collection:
- [ 1:30, foo ]
- [ 2:20, bar ]

Most YAML frameworks will provide the necessary features to deserialize these YAML sequences into appropriate data classes. We still are within YAML syntax. Now, let's see if we can get actual CSV in there:

name: My Name
computer_name: My Computer
collection: |
  1:30;foo
  2:20;bar

Using a YAML literal block scalar, we now input the collection data as scalar, which we can then parse using a CSV parser. We can even instruct our YAML implementation to do this immediately when encountering the value of collection:.

It would be more difficult to do this with JSON as the master serialization format, since JSON is not equipped with block scalars. XML would also work, but is very bloaty by itself.

While we're at YAML, there is another possible solution: Use a document end marker to signal to the YAML parser that the YAML document ends here, and put the CSV data after it. Similar things are done in Jekyll to separate the „YAML front matter“ from contents. It would look like this:

name: My Name
computer_name: My Computer
...
1:30;foo
2:20;bar

... is the document end marker. Jekyll uses --- instead, which according to the spec would start a second YAML document there, and I don't know why they chose to do that. ... is the more spec-compliant way.

You've really understood the problem and have given an excellent response. The question did ask if a better alternative approach was possible and you have come up with an ingenious approach. Thanks! — Bit Racketeer, Jun 14 '18 at 07:28

score 1 · Answer 2 · answered Jun 13 '18 at 14:34

See this.

Create some model using XmlAttribute:

public class Foo
{
    [XmlAttribute]
    public string Bar { get; set; }
    [XmlAttribute]
    public List<int> List1 { get; set; }
    [XmlAttribute]
    public List<double> List2 { get; set; }
}

Serialize it:

var foo = new Foo
{
    Bar = "test",
    List1 = new List<int> { 1, 2, 3 },
    List2 = new List<double> { 0.1, 0.2, 0.3 }
};

var xs = new XmlSerializer(typeof(Foo));
var settings = new XmlWriterSettings { NewLineOnAttributes = true, Indent = true };
using (var xmlWriter = XmlWriter.Create(Console.Out, settings))
{
    xs.Serialize(xmlWriter, foo);
}

Console.WriteLine();

The result is compact and quite readable:

<Foo xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  Bar="test"
  List1="1 2 3"
  List2="0.1 0.2 0.3" />

Don't reinvent the wheel.

I tried something like this, but it creates files that are difficult check by eye because you don't lose alignment between the lists. Even if they're say 30 items long it's difficult to follow them both. — Bit Racketeer, Jun 13 '18 at 18:19

score 0 · Answer 3 · answered Jun 21 '18 at 06:03

I have found two compromise solutions that work well.

Save a file for each serialization format with the same filename and a different extension e.g. <GUID>.csv <GUID>.xml <GUID.yaml> <GUID>.json
Use YAML approach as outlined by flyx above

Therefore flyx's answer has been accepted as the answer. Many thanks!

Store multiple serialization formats in one file? AKA avoiding serialization bloat for collections

3 Answers3