Understanding the concept of 'serializing by reference'

Question

I'm writing my own binary serializer optimized for game development. So far it's fully functional. It emits IL to generate the [de]serialization methods given a sequence of types in advance. The only missing feature is serializing things by reference, everything is currently being serialized by value.

In order to implement it, I have to understand it first. This is what I'm finding to be a bit tricky. Let me show you what I understood in these couple of examples:

Example 1 (as seen here):

public class Person
{
    public string Name;
    public Person Friend;
}

static void Main(string[] args)
{
    Person p1 = new Person();
    p1.Name = "John";

    Person p2 = new Person();
    p2.Name = "Mike";

    p1.Friend = p2;

    Person[] group = new Person[] { p1, p2 };

    var serializer = new DataContractSerializer(group.GetType(), null, 
        0x7FFF /*maxItemsInObjectGraph*/, 
        false /*ignoreExtensionDataObject*/, 
        true /*preserveObjectReferences : this is where the magic happens */, 
        null /*dataContractSurrogate*/);

    serializer.WriteObject(Console.OpenStandardOutput(), group);
}

Now this is completely understood. We have a root object which is the array, referencing two unique persons. The p1.Friend happens to be the p2. So instead of serializing the p1.Friend by value we just store an id that points to p2 which we've already serialized.

However; have a look at this second example:

    static void Example2()
    {
        var p1 = new Person() { Name = "Diablo" };
        var p2 = new Person() { Name = "Mephesto" };

        p1.Friend = p2;

        var serializer = new DataContractSerializer(typeof(Person), null, 0x7FFF, false, true, null);

        serializer.WriteObject(Console.OpenStandardOutput(), p1);
        Console.WriteLine("\n");
        serializer.WriteObject(Console.OpenStandardOutput(), p2);
    }

Now, according to my understanding: when serializing p1 the serializer will serialize p1.Name and p1.Friend. In the second WriteObject, the serializer has already serialized p2 (which is p1.Friend) so it just serializes an id that points to p1.Friend instead of serializing it by value.

Running the code and viewing the output it doesn't seem to be the case. In the 2nd output we see the serializer serializing p2 by value as if it hasn't came across it yet... And that I didn't get. It's like there's an id counter internally that gets reset at the end of WriteObject

enter image description here

Here's another similar example:

    static void Example3()
    {
        var p1 = new Person() { Name = "Diablo" };
        var p2 = p1;

        var serializer = new DataContractSerializer(typeof(Person), null, 0x7FFF, false, true, null);

        serializer.WriteObject(Console.OpenStandardOutput(), p1);
        Console.WriteLine("\n");
        serializer.WriteObject(Console.OpenStandardOutput(), p2);
    }

Again, the second output shows that we're serializing p2 as if we haven't encountered a definition for it yet.

Note that I didn't choose DataContractSerializer for any particular reason, any serializer that supports serializing by reference works.

I tried to ILSpy on DataContractSerializer but I got lost quickly and couldn't figure out much.

In Example2, why didn't the serializer store an id to p1.Friend when serializing p2? - Is 'serializing by reference' only applied to a single object hierarchy, or how does it work in general?
It seems to me that serializing by reference will automatically handle circular referencing (A <-> B), is that correct? or do I need to do other things to make sure I won't fall into an infinite loop?
I assume serializing by reference makes sense only when applied on reference-types and not value-types, correct?

I've tagged protobuf-net cause it's similar in that it's a binary serializer and emits IL. I would love to hear how seiralizing by reference is implemented there :p

Side note: you don't have to use ILSpy, you can look at the code here - http://referencesource.microsoft.com/#System.Runtime.Serialization/System/Runtime/Serialization/DataContractSerializer.cs — Alexei Levenkov, May 01 '15 at 02:31

score 2 · Accepted Answer · answered May 01 '15 at 14:15

2

Each call to write-object is a separate serialization context; the reference-tracking is not preserved between calls
As long as you correctly identify previously seen values, it shouldn't get recursive, but a depth check can help avoid issues
Correct, although you could attempt to recognise semantically identical value types if you wanted (perhaps the structural equality interface)

Additional thought: if you apply this to strings, you might want to special-case as effective equality rather than reference equality - no point serialising two different instances (references) of the same string

answered May 01 '15 at 14:15

Marc Gravell

1,026,079
266
2,566
2,900

Thanks for the reply. On 1- Is this a standard in all serializers that implement serialize by ref? or only in DataContractSerializer? does this mean there's no way in example 2 and 3 to serialize the second object by reference? or that doesn't make sense and it's a part of my misunderstanding? – vexe May 01 '15 at 17:11
1

@vexe this is pretty much standard in all serializers; you *might* be able to find one that has an API for preserving such, but that is **not** the expected scenario; it *expects* that each call to `Serialize` / `Write` / whatever is unrelated and separate. In particular, if you call `Write(obj, target1)` and `Write(obj, target2)`, it is normally expected that `target1` and `target2` will both have **identical** contents (or at least: equivalent contents). That cannot be the case if the second time it just says "I've seen these objects before: I'll just serialize the ids" – Marc Gravell May 01 '15 at 17:13
I was thinking about using protobuf-net and compile a dll for a type with fields marked with [AsReference] (if I remember correctly) and then view the generated code. It should have the 'serialize by reference' logic baked into the dll correct? so I can maybe get a general idea how to go about it – vexe May 01 '15 at 18:26
@vexe the logic for there is in the core lib, not the generated IL. There's *basically* a reference-keyed dictionary – Marc Gravell May 01 '15 at 23:27
Thank you. Lastly, do you recommend any reading/books that go in-depth into the subject? All the stuff I read teaches you how to use existing serializers and not how they work internally/how to implement one yourself. – vexe May 02 '15 at 00:27
@vexe I don't know of any such; you can read the source for XmlSerializer, Json.net, protobuf-net, Jil, etc - though! – Marc Gravell May 02 '15 at 05:40
So I've implemented it and it's working fine :) could you elaborate a bit on your last 'additional thought' idea? what do you mean with 'effective equality'? I assume you mean char-by-char equality? isn't the normal `==` on strings enough for that or do I need some special compare/equality class? cause I looked at your source and I saw you just newing up that strings dictionary with no arguments so the default equality stuff applies. – vexe May 04 '15 at 17:06
@vexe yeah, regular string equality is fine; the point I was trying to make was: don't use `object.ReferenceEquals` for strings – Marc Gravell May 04 '15 at 17:12

Understanding the concept of 'serializing by reference'

1 Answers1