Instead of Binary Serialization you could use http://code.google.com/p/protobuf-net/ and then calculate a crypto hash of it. protobuf is said to be more compact than Bin Ser (see for example http://code.google.com/p/protobuf-net/wiki/Performance ).
I'll add that, considering you don't really need to serialize. It would be better to use Reflection and "navigate" through the objects calculating your hash (in the same way the various Serializers "traverse" your object). See for example Using reflection in C# to get properties of a nested object
After much thought, and hearing what @Jon said, I can tell you that my "secondary" idea (using Reflection) is VERY VERY VERY difficult, unless you want to spend a week on writing an object parser. Yes, it's doable... But what representation would you give to the data before calculating the Hash? To be clear:
two strings
"A"
"B"
clearly "A", "B" != "AB", "". But MD5("A") combined with MD5("B") == MD5("AB") combined with MD5(""). Probably the best is to prepend the length (so using Pascal/BSTR notation)
And null
values? What "serialized" value do they have? Another though question. Clearly if you serialize a string as length+string (so to solve the previous problem), you could serialize null simply as "null"
(no length)... And the objects? Would you prepend an object type id? It would be surely better. Otherwise variable length objects could make the same mess as strings.
Using BinaryFormatter (or even the protobuf-net probably) you don't truly have to save somewhere the serialized object, because they both support streaming... An example
public class Hasher : Stream
{
protected readonly HashAlgorithm HashAlgorithm;
protected Hasher(HashAlgorithm hash)
{
HashAlgorithm = hash;
}
public static byte[] GetHash(object obj, HashAlgorithm hash)
{
var hasher = new Hasher(hash);
if (obj != null)
{
var bf = new BinaryFormatter();
bf.Serialize(hasher, obj);
}
else
{
hasher.Flush();
}
return hasher.HashAlgorithm.Hash;
}
public override bool CanRead
{
get { throw new NotImplementedException(); }
}
public override bool CanSeek
{
get { throw new NotImplementedException(); }
}
public override bool CanWrite
{
get { return true; }
}
public override void Flush()
{
HashAlgorithm.TransformFinalBlock(new byte[0], 0, 0);
}
public override long Length
{
get { throw new NotImplementedException(); }
}
public override long Position
{
get
{
throw new NotImplementedException();
}
set
{
throw new NotImplementedException();
}
}
public override int Read(byte[] buffer, int offset, int count)
{
throw new NotImplementedException();
}
public override long Seek(long offset, SeekOrigin origin)
{
throw new NotImplementedException();
}
public override void SetLength(long value)
{
throw new NotImplementedException();
}
public override void Write(byte[] buffer, int offset, int count)
{
HashAlgorithm.TransformBlock(buffer, offset, count, buffer, offset);
}
}
static void Main(string[] args)
{
var list = new List<int>(100000000);
for (int i = 0; i < list.Capacity; i++)
{
list.Add(0);
}
Stopwatch sw = Stopwatch.StartNew();
var hash = Hasher.GetHash(list, new MD5CryptoServiceProvider());
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}
I define a Hasher
class that receives the serialization of the object (a piece at a time) and calcs the hash in "streaming mode". The memory use is O(1). The time is clearly O(n) (with n the "size" of the serialized object).
If you want to use protobuf (but be aware that for complex objects it needs them to be marked with its attributes (or with WCF attributes or...))
public static byte[] GetHash<T>(T obj, HashAlgorithm hash)
{
var hasher = new Hasher(hash);
if (obj != null)
{
ProtoBuf.Serializer.Serialize(hasher, obj);
hasher.Flush();
}
else
{
hasher.Flush();
}
return hasher.HashAlgorithm.Hash;
}
The only "big" differences are that protobuf doesn't Flush
the stream, so we have to do it, and that it TRULY wants that the root object be typed and not a simple "object".
Oh... and for your question:
How should I serialize the object? It
must be fast and not consume too much
memory. Also it must reliably always
be serialized the same way. If I use
the .NET default serialization can I
really be sure that the created binary
stream is always the same if the
acutal data is the same? I doubt it.
List<int> l1 = new List<int>();
byte[] bytes1, bytes2;
using (MemoryStream ms = new MemoryStream())
{
new BinaryFormatter().Serialize(ms, l1);
bytes1 = ms.ToArray();
}
l1.Add(0);
l1.RemoveAt(0);
using (MemoryStream ms = new MemoryStream())
{
new BinaryFormatter().Serialize(ms, l1);
bytes2 = ms.ToArray();
}
Debug.Assert(bytes1.Length == bytes2.Length);
Lets say this: the Debug.Assert
will fail. This because List "saves" some internal status (for example a version). This makes very difficult to Binary Serialize and compare. You would be better to use a "programmable" serializer (like proto-buf). You tell him what properties/fields to serialize and he serializes them.
So what would be an alternative way to serialize that doesn't take to long to implement?
Proto-buf... or DataContractSerializer (but it's quite slow). As you can imagine, there isn't a silver bullet to data serialization.