Creating a checksum on an object graph

Question

This question is related to this one but I think should be asked separately.

I have a complex graph of object instances. Now I would like to create a checksum on this object graph directly in memory to detect whether changes have been made to it since the last time the checksum was saved with the object graph. The checksum calculation should be quick and should not consume too much memory.

As I understand now the best solution would probably be to generate a cryptographic key on a binary serialized form of the object graph (correct me if I am wrong). But that comes with a few questions:

How should I serialize the object? It must be fast and not consume too much memory. Also it must reliably always be serialized the same way. If I use the .NET default serialization can I really be sure that the created binary stream is always the same if the actual data is the same? I doubt it.
So what would be an alternative way to serialize that doesn't take to long to implement?

Update:

What do you think about this approach:

navigate through the graph and foreach object in the graph create a standard int hashcode using this algorithm (but exclude reference type members representing nodes in the graph). Add each hashcode to a integer list
convert the integer list to a byte array
create a hash on the byte array using MD5, CRC or similar

The GetHashCode algorithm mentioned should quickly calculate a hashcode that is pretty collision safe for a single object that only takes its primitive members into account. Based on this the byte array should also be a pretty collision safe representation of the object graph and the MD5/CRC hash on this too.

A checksum that doesn't "consume too much memory" is not guaranteed to detect whether changes have been made. If you can live with some very rare false negatives (i.e., same checksum but object graph is actually different), then it may be okay. — Justin, Mar 18 '11 at 17:25
You'll probably have better luck if you ask these as separate questions. — Pat, Mar 18 '11 at 17:47
@Justin: Yes a checksum doesn't but serializing big object graphs to a binary stream does. — bitbonk, Mar 19 '11 at 08:31
@Justin Your RAM and CPU working without random errors isn't guaranteed either. And for any decent checksum(for example sha-1) the chances of the computer making a random error is larger than a collision. — CodesInChaos, Mar 19 '11 at 10:13
Serializing the whole graph (to have higher quality input for your hash function) and making it run fast are conflicting requirements. You should specify what matters more to you and a ballpark estimate for how large your graph and its serialized representation would be. — Jon, Mar 21 '11 at 13:18
It is *slightly* more important that the hashing algorithm is not too expensive than safe. I am aware of the two conficting requirements and was hoping to find a way to serialize the graph that is as cheap and compact as possible but still produces a fairly high quality input. — bitbonk, Mar 22 '11 at 07:42
@Justin: If you checksum/hash has fewer bits than the object you are hashing, it is highly likely there are two different object graphs that produce the same hash code. That doesn't change the utility of a checksum as a cheap "different" check. — Ira Baxter, Mar 25 '11 at 23:52
@bitbonk, do you want to check the actual objects' contents for changes or the graph itself? — Andrei Sosnin, Mar 27 '11 at 15:13
For the hash itself, CRC32 is the fastest, but is also the least safe, AFAIK. If the performance matters so much, I'd stick with that. Otherwise hashes don't use much memory themselves. MD5 only needs no more than 1 kB of memory to operate nicely -- it can be used on streams just as well. Read up on MD5 design on Wikipedia to understand why, it's really simple on the higher level. — Andrei Sosnin, Mar 27 '11 at 15:20

score 9 · Answer 1 · edited May 23 '17 at 12:00

Instead of Binary Serialization you could use http://code.google.com/p/protobuf-net/ and then calculate a crypto hash of it. protobuf is said to be more compact than Bin Ser (see for example http://code.google.com/p/protobuf-net/wiki/Performance ).

I'll add that, considering you don't really need to serialize. It would be better to use Reflection and "navigate" through the objects calculating your hash (in the same way the various Serializers "traverse" your object). See for example Using reflection in C# to get properties of a nested object

After much thought, and hearing what @Jon said, I can tell you that my "secondary" idea (using Reflection) is VERY VERY VERY difficult, unless you want to spend a week on writing an object parser. Yes, it's doable... But what representation would you give to the data before calculating the Hash? To be clear:

two strings
"A"
"B"

clearly "A", "B" != "AB", "". But MD5("A") combined with MD5("B") == MD5("AB") combined with MD5(""). Probably the best is to prepend the length (so using Pascal/BSTR notation)

And null values? What "serialized" value do they have? Another though question. Clearly if you serialize a string as length+string (so to solve the previous problem), you could serialize null simply as "null" (no length)... And the objects? Would you prepend an object type id? It would be surely better. Otherwise variable length objects could make the same mess as strings.

Using BinaryFormatter (or even the protobuf-net probably) you don't truly have to save somewhere the serialized object, because they both support streaming... An example

public class Hasher : Stream
{
    protected readonly HashAlgorithm HashAlgorithm;

    protected Hasher(HashAlgorithm hash)
    {
        HashAlgorithm = hash;
    }

    public static byte[] GetHash(object obj, HashAlgorithm hash)
    {
        var hasher = new Hasher(hash);

        if (obj != null)
        {
            var bf = new BinaryFormatter();
            bf.Serialize(hasher, obj);
        }
        else
        {
            hasher.Flush();
        }

        return hasher.HashAlgorithm.Hash;
    }

    public override bool CanRead
    {
        get { throw new NotImplementedException(); }
    }

    public override bool CanSeek
    {
        get { throw new NotImplementedException(); }
    }

    public override bool CanWrite
    {
        get { return true; }
    }

    public override void Flush()
    {
        HashAlgorithm.TransformFinalBlock(new byte[0], 0, 0);
    }

    public override long Length
    {
        get { throw new NotImplementedException(); }
    }

    public override long Position
    {
        get
        {
            throw new NotImplementedException();
        }
        set
        {
            throw new NotImplementedException();
        }
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        throw new NotImplementedException();
    }

    public override long Seek(long offset, SeekOrigin origin)
    {
        throw new NotImplementedException();
    }

    public override void SetLength(long value)
    {
        throw new NotImplementedException();
    }

    public override void Write(byte[] buffer, int offset, int count)
    {
        HashAlgorithm.TransformBlock(buffer, offset, count, buffer, offset);
    }
}

static void Main(string[] args)
{
    var list = new List<int>(100000000);

    for (int i = 0; i < list.Capacity; i++)
    {
        list.Add(0);
    }

    Stopwatch sw = Stopwatch.StartNew();
    var hash = Hasher.GetHash(list, new MD5CryptoServiceProvider());
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds);
}

I define a Hasher class that receives the serialization of the object (a piece at a time) and calcs the hash in "streaming mode". The memory use is O(1). The time is clearly O(n) (with n the "size" of the serialized object).

If you want to use protobuf (but be aware that for complex objects it needs them to be marked with its attributes (or with WCF attributes or...))

public static byte[] GetHash<T>(T obj, HashAlgorithm hash)
{
    var hasher = new Hasher(hash);

    if (obj != null)
    {
        ProtoBuf.Serializer.Serialize(hasher, obj);
        hasher.Flush();
    }
    else
    {
        hasher.Flush();
    }

    return hasher.HashAlgorithm.Hash;
}

The only "big" differences are that protobuf doesn't Flush the stream, so we have to do it, and that it TRULY wants that the root object be typed and not a simple "object".

Oh... and for your question:

How should I serialize the object? It must be fast and not consume too much memory. Also it must reliably always be serialized the same way. If I use the .NET default serialization can I really be sure that the created binary stream is always the same if the acutal data is the same? I doubt it.

List<int> l1 = new List<int>();

byte[] bytes1, bytes2;

using (MemoryStream ms = new MemoryStream())
{
    new BinaryFormatter().Serialize(ms, l1);
    bytes1 = ms.ToArray();
}

l1.Add(0);
l1.RemoveAt(0);

using (MemoryStream ms = new MemoryStream())
{
    new BinaryFormatter().Serialize(ms, l1);
    bytes2 = ms.ToArray();
}

Debug.Assert(bytes1.Length == bytes2.Length);

Lets say this: the Debug.Assert will fail. This because List "saves" some internal status (for example a version). This makes very difficult to Binary Serialize and compare. You would be better to use a "programmable" serializer (like proto-buf). You tell him what properties/fields to serialize and he serializes them.

So what would be an alternative way to serialize that doesn't take to long to implement?

Proto-buf... or DataContractSerializer (but it's quite slow). As you can imagine, there isn't a silver bullet to data serialization.

@Jon You can "cache" the object properties discovered "once" with Reflection for later use. I don't think it's much slower... And in the end I think the various serializers do it in this way. — xanatos, Mar 21 '11 at 13:18
@Jon OR you can use Expressions. They seems to be faster http://stefan.rusek.org/Posts/LINQ-Expressions-as-Fast-Reflection-Invoke/3/ You "explore" the object with Reflection and then create some Expressions to save them. — xanatos, Mar 21 '11 at 13:21
@xanatos: All I 'm saying is that a naive implementation with Reflection won't work satisfactorily, and that a considerable amount of work would be needed to make it so. Not that it's not doable. — Jon, Mar 21 '11 at 13:27
@Jon yep... Probably at least a week of work if you want to make it good... I've looked around and I haven't found any "pre-made" object parser... And I'm not even sure how you should break apart the fields... Nulls and variable length objects give me the fits ("A", "B" != "AB", "") — xanatos, Mar 21 '11 at 14:36
+1, I was going to suggest something along the same lines, but your solution is better than what I had in mind... However it wouldn't hurt to implement the CanRead/CanSeek properties by just returning false, and throw NotSupportedException rather than NotImplementedException... — Thomas Levesque, Mar 21 '11 at 21:07
@Thomas I've left the code as given by Visual Studio. This isn't a complete solution, but it's "working". I have tested it with simple objects. — xanatos, Mar 21 '11 at 21:19
@xanatos: What do you think about my "simplistic" approach? (See **Update:** in my question) — bitbonk, Mar 22 '11 at 07:36
@bitbonk I don't think the problem is in USING the GetHashCode. I think the problem(s) are: A) Being sure that all the classes implement it CORRECTLY (a good/resistant hash code) and B) Traversing all the childs (being sure of not traversing the same child twice, otherwise you could create an infinite loop where A references B and B references A). In the last step of your idea, it's useless to use an MD5 or a CRC32. You can simply reuse your "standard" formula for GetHashCode. And I see Skeets already suggested you to NOT use the GetHashCode! — xanatos, Mar 22 '11 at 08:26
@xanatos: I can make sure A) and B) are fulfilled because I will be implementing it (for each object) myself. I understood that @JohnSkeet suggested to not use this algorithm for calculating a hashcode for a complex object graph and I am not: I would use this algorithm just for the object itself (just the simple propeerties without references). This should be safe. The mighty @JohnSkeet also says: *"It's fast and creates a pretty good hash which is unlikely to cause collisions"* — bitbonk, Mar 22 '11 at 09:02
@bitbonk The problem with a 32 bits hashcode is the "birthday problem", read here http://en.wikipedia.org/wiki/Birthday_problem Your hash is perhaps even less resistant than a CRC32. For 1-3 int properties, its values are probably 24 bits or less (let's say (a, b, c) = (1, 2, 3), the hash is 207417. You are using 18 bits. A good hash function should be uniform ( http://en.wikipedia.org/wiki/Hash_function#Uniformity ). The hash suggested by Skeet is good for a Dictionary/Hashtable where there is a "Plan B" for collisions. You should AT LEAST use CRC32. — xanatos, Mar 22 '11 at 09:11
@xanatos: I would use integer GetHashCode for a single object (there shouldn't usually be collisions between any two objects) and I would use CRC32 for the **list** of integers (converted to a big byte stream). — bitbonk, Mar 22 '11 at 09:15
@bitbonk My point is that using CRC32 as the last step is quite useless. You could simply GetHashCode the array of GetHashCodes. Your weak link isn't in the last step. — xanatos, Mar 22 '11 at 09:19
That doesn't make sense to me. It is pretty safe to assume that the big bytestream created form the list of integer hashcodes is very unique. If at least one object changes one of its properties or at least one object gets removed or added, the byte stream will differ. The chance of collisions is *extremely* rare. If I create CRC2 hashcode on that byte stream I will be safe. However if I only create an integer hashcode on that big bytestream, collisions are much likelier to happen. — bitbonk, Mar 22 '11 at 09:28
@bitbonk To calc the birthday problem chances you can use this: double duplicate = 1 - (Math.Exp(-Math.Pow(items, 2.0) / (2.0 * hashSpace))); It's 0-1 (with 1 = 100%). Hashspace is how many values can have your hash. Items is how many items at a time you are calculating your hash on (consider only items of the same type). (taken from http://sites.google.com/site/craigandera/craigs-stuff/odds-ends/the-birthday-problem-calculator ) How much big is the graph you are trying to hash? — xanatos, Mar 22 '11 at 09:37
It should be safe to assume that there will never be more than 3000 objects. — bitbonk, Mar 22 '11 at 10:00
@bitbonk If we consider the hash function suggested to give 24 bits of distribution, you have 24% of a collision :-) If instead you consider the full 32 bits, there is only a 0.1%. It's still quite much. But if you consider that you don't have 3000 objects of the same type, but they are "subdivided" in various types, the chances go down. — xanatos, Mar 22 '11 at 10:09
@bitbonk In the end I suggest you at least try MD5 + protobuf and see if it's fast enough. The damage caused by a failed "fast" hash are probably greater than the time lost to MD5 everything. — xanatos, Mar 22 '11 at 10:17
You are probably right about this. Allthough I would have prefered to not add dependencies to 3rd party libraries. Also I am currently on .NET 2.0 wich prevents me from using DataContractSerializer — bitbonk, Mar 22 '11 at 10:40
@bitbonk The 2.0 part is pretty big, you know? You should have specified it quite clearly. You should put the tag `.net-2.0` to your questions. This cuts out quite many things. — xanatos, Mar 22 '11 at 10:49
Well, I was also interested in reading about how I could do it if I were up to date :) — bitbonk, Mar 22 '11 at 10:56

score 7 · Accepted Answer · answered Mar 26 '11 at 17:01

What do you think about this approach:

navigate through the graph and foreach object in the graph create a standard int hashcode using this algorithm (but exclude reference type members representing nodes in the graph).

Add each hashcode to a integer list

Convert the integer list to a byte array

Create a hash on the byte array using MD5, CRC or similar

This approach idea is quite near to what I 'd consider best, but it could use some polishing.

Hashing

Considering that you would prefer speed over accuracy and that an int-sized hashcode for each item leaves plenty of room for avoiding collissions, the choice of hashcode algo seems right. Excluding reference types that participate in the graph means we 're throwing some information away; see below for more on that.

Improving the node hash

The idea of not taking into account other nodes connected to the node we are hashing is correct, but maybe we can do better than simply throwing all that information away? We don't want to take the hashcodes of other nodes into account (they will be hashed themselves as well), but we are throwing away the information provided by the graph edges here: the hashcode for a node with internal data X connected to N other nodes should not be the same for a node with data X connected to M other nodes.

If you have a cheap way of using a part of the edge data into account, use it. For example, if the graph is directed then you can add to the hashcode computed for each node the number of edges going out from it to other nodes.

Aggregating hashcodes

Creating a list of hashcodes would be the middle-ground approach between summing the hashcodes in one long (very fast and keeps some additional information over summing into an int) and creating a list of hashcodes dependent on a total order of the items in the graph. If you expect lots of items in the graph then summing might be more appropriate (I 'd try that first and see if it's collision-free enough); if the graph doesn't have many items (say < 1000) then I 'd try the total-order approach first. Remember to allocate enough memory for the list (or simply use an array) when creating it; you already know its final length so that's a free speed increase.

Producing a fixed-size hash

If you have summed the hashcodes into a primitive, this step is not required at all. Otherwise, hashing the list as a byte[] is what I 'd consider best. Since hashing the bytes will take very little time in comparison to creating the list, you may want to use a larger-sized hash function than md5 or crc32 to reduce collisions without a practical performance hit.

Improving the final hash quality

After getting this "final" hash, I 'd prepend or append to it the number of items in the hashed graph as fixed-size hex-encoded string because:

It might help in reducing collisions (how much depends on the nature of the graphs)
We already know the number of items in the graph (we just hashed each one of them) so it's an O(1) operation

Defining a total order

If the order in which the items in the graph are processed is not strictly defined, then the door is open for false negatives: two graphs which should hash to the same value do not because even though they are logically equivalent, the implementation of the hash function chose to process the per-item hashes in a different order. This problem will appear only if you use a list, since addition is transitive so the "add into a long approach" is immune to it.

To combat that, you need to process the nodes in the graph in a well-defined order. That might be an order that's easy to produce from the data structure of the nodes (e.g. like preorder traversal on a tree) and/or other information (e.g. class names or node types for each node, node ids if such exist etc).

Since preprocessing the graph to produce a total order is going to take some time, you may want to weigh that against the cost incurred by a false negative result as I mentioned above. Also, if the graphs are large enough then this discussion might be moot because of the node hashcode summation approach being more suited to your needs.

Downvoter: If you find this answer deficient, I am open to criticism. — Jon, Mar 29 '11 at 00:10

Todd Menier · Answer 3 · 2014-09-25T21:31:15.503

Here's the approach I use:

1. Serialize using ServiceStack's TypeSerializer

This serializes objects to JSV, which I'd vaguely describe as "JSON with fewer quotes", so it's smaller and purported (by the author) to be about 5x faster than JSON serialization. The main advantage over BinaryFormatter and Protobuff (which would have otherwise been my first choice), is that you don't have to go around annotating or describing all the types you want to serialize. I'm lazy like that, and this just works with any poco.

2. Compute MD5 hash

This is a "good enough" approach for me in terms of performance and collision characteristics. If I were too improve upon it, I'd likely go with MurmurHash3, which has similar collision characteristics as MD5 but is much faster. It is not suitable for cryptographic purposes but it sounds like that is not a requirement here. The only reason I've gone with MD5 is it's baked in the BCL and it's fast enough for my purposes.

Here's the whole thing as an extension method:

using System.Text;
using System.Security.Cryptography;
using ServiceStack.Text;

public static byte[] GenerateHash(this object obj) {
    var s = TypeSerializer.SerializeToString(obj);
    return MD5.Create().ComputeHash(Encoding.UTF8.GetBytes(s));
}

The objects I'm using with this are relatively small (typically no more than a couple hundred characters serialized) and I've never run into collision issues. YMMV.

Serializing to string? That infers I could create a JSON string and get an MD5 hash from that. Is that correct? — IAbstract, Dec 18 '21 at 23:49

Ira Baxter · Answer 4 · 2011-03-25T23:48:06.510

2

I think what you want is to generate a canonical order for the objects, sort the objects into that order, and then compute a hash on the objects in that sorted order.

One way to do that is to define a relation between objects which is always "<" or ">" if the objects do not contain identical content (in which case the objects are "==" according to the relation) [note: this doesn't account for the fact the arcs from identical content objects might allow you distinguish them as "<" or ">"; if this matters to you, define a canonical order on arcs, too] Now, enumerate all the objects in the graph, and sort by this relation. Process the objects in the sorted order, and compose their hashes.

I'd expect this to run really fast, certainly much faster than any solution involving serialization, because it isn't generating giant text (or even binary) strings from values.

edited Mar 25 '11 at 23:48

answered Mar 24 '11 at 04:24

Ira Baxter

93,541
22
172
341

To me this seems pretty close to what I was up to in the update of my question. Now the question is whether the simple hashcode algorithm for a single obect I was about to use is sufficient and how I would compose the hashes. Would I just write all the hashes to a BinaryWriter and then create a MD5 (or similar) on that stream? – bitbonk Mar 25 '11 at 16:08
It hardly matters, if what you are trying to do is to create a checksum. In this case, all you care about is that most of the time it will easily detect a change; in this case, I'd probably simply add the 64-bit checksums and ignore overflows. – Ira Baxter Mar 25 '11 at 17:06
... If you want cryptographic quality signature, you might need something like you suggested (and in fact the current cryto guys will complain about you using MD5 since somebody find a hard-to-exercise hole in it). I'm not the right person to answer that question, but surely the problem of composing crytographic-quality hashes of subobjects into a whole has been explored. – Ira Baxter Mar 25 '11 at 17:06
Let me note that your update is close, but misses the crucial *ordering* idea. If you don't do that, another identical (isomorphic) structure layed out in memory differently (e.g., during a different execution) will produce a different hash, and I don't think that's what you want. If you are willing to accept "just" a hash, and use a hash-combining scheme which is insensitive to order ("add hashes" is one), then you can get away without sorting, otherwise you can't. – Ira Baxter Mar 25 '11 at 17:08
How would you do that with heterogeneous objects? – xanatos Mar 25 '11 at 18:10
@xanatos: What's the problem? Defining an arbitrary canonical order relation is easy. If you have two objects of different types A and B, then the A object is arbitrarily less than the B object (use the string name of the object type to compare, for example). If they are the same type, then compare their fields until you find one field less than the corresponding field in the other. – Ira Baxter Mar 25 '11 at 18:16
Ah ok, you sort them based on FullName. Yeah, right. I hadn't thought of it. My mind is still locked in the "you can't compare apples and pears" mode :-) – xanatos Mar 25 '11 at 18:17
@xanatos: Well, whether apples are better than pears is literally a matter of taste :-} All that matters is that you decide consistently. – Ira Baxter Mar 27 '11 at 07:00

Andrei Sosnin · Answer 5 · 2011-03-27T15:52:26.087

As Ira Baxter noted, you want to re-arrange (sort) the objects in the graph in a specific canonical order. Then you can go down to calculating hashes and reducing (as in 'map-reduce') them to a single hash.

As a performance trick, it is also sometimes good to try and keep the graph in that way all the time -- it's sometimes easier to keep a collection sorted, than sort it all after an update transaction.

This is where you can use a trick to minimise memory and CPU usage. You need to analyse, how often do objects and the graph change and how often do you want to know, if the graph of objects has changed.

As I've mentioned in the comment to your question, MD5 and similar hash algorithms don't use much memory -- less than a kilobyte per instance. You only need to keep a block (512 bytes) of data to be hashed at a time.

If you're lucky, your objects and graph will change a lot (i.e. many objects changing state one-after-another), but you only want to know about that just once in a while (i.e. only after the whole graph-updating transaction is over). In this case you just want to calculate the hashes only after the transaction is over. Or maybe only on demand (i.e. when you push an update event or poll it for changes from a separate thread). In this case, to save memory, you want to feed MD5/SHAxxx hash calculating object a stream of data chunks keeping as fewer intermediate values as possible. This way your memory usage will be constant, independent (as in O(1)) of the graph size.

Now, if you're even luckier, your objects don't change much, if at all, but you want to know, if they've changed right away, by raising an event for each change, for example. In this case you want to modify, i.e. wrap or otherwise extend, the objects to calculate a hash or simply to check them for actual changes. Push a 'changed' event in every property setter of an object. Same with the graph changes as well. This will save you from calculating hashes altogether (a massive performance gain in some cases).

If your objects change rarely and you also need to check them for changes rarely (that includes cases with de/serialization used somewhere in the process), then the first approach still works best.

It's generally counterproductive to try to calculate hashes for complex objects in a graph that changes often to know about every change happening inside immediately (to act upon every one of them). In this case you want to make some kind of a change signalling approach with events (best for .NET) or callbacks.

Sorting ~~3000 objects on demand isn't expensive, especially considering it will only be done when a checksum is needed which I would expect to be rare. — Ira Baxter, Mar 27 '11 at 17:07
@Ira Baxter There's no mention about ~3000 items in that graph, so I wouldn't be so sure. How do you know, what kind of graph is bitbonk referring to? Why couldn't it be a big continent's detailed map of roads or a really big social network's connections graph, for example? Or did I miss that detail somewhere in the comments? — Andrei Sosnin, Mar 28 '11 at 08:44