I am importing some data from an file (xls, csv, xml) wich will result in a complex in-memory object graph. Now I need to know whether this graph has been modified since it once was exported. What would be a safe way to check this? I suppose I'd export a hashcode with the file? If so would the standard way of generating an object's hashcode suffice? How should I generate the hash? I would prefer to generate the hash on the object graph rather than on the actual stream/file.
3 Answers
You can ensure that nobody changes your data by encrypting it or using a hashcode. In case of the text based formats you mentioned you would loose the human-readability, so I think you wolud prefer hashcodes.
If standard hashing methods can be applied heavily depends on what exactly you consider "safe": If you just want to make sure that there was no hardware error when storing/transferring the data or if you want to detect a simple change of someone who did not know what he's doing, that might be fine - if you made sure that you are using a good GetHashCode() function. If you want to protect the data against "attackers" I wouldn't rely on a 32bit "homemade" hash. (Especially if the "attacker" might know the code, e.g. in Open Source projects).
In such cases I would prefer stronger hash functions like MD5 (not very collision safe) or better SHA-2. These work on byte streams you have to hash the data (XML etc.) itself or maybe the .net-serialized data (which makes the hash independent from the data format of your file). .net provides classes for these algorithms, see for example http://msdn.microsoft.com/de-de/library/system.security.cryptography.hmacsha256.aspx

- 11,549
- 8
- 66
- 126
-
I simply just want to detect a change to the actual data (a user added an item, changed a property value etc.) in a quick and reliable way. So I wonder if this is reliable enough: http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode – bitbonk Mar 15 '11 at 13:54
-
There is one catch all around your 2 questions - since anyone can CHANGE the data, only way to check if the 'data' is the same is to either check every bit of information one at a time, or calculate hashes of both sides and compare hashes. And since calculating hashes on the data that is probably changed means that you have to READ all the data anyway, what's the point of not comparing the data directly? – Daniel Mošmondor Mar 21 '11 at 21:18
-
Because I do not have the original data. Once it was exported it can be deleted locally. When I import it later, I still need to be able to tell if the data was changed. That's why the original hash will be saved WITH the exported file and that I then later upon import can use to validate the data against. – bitbonk Jul 18 '11 at 13:58
The standard solution for your problem isn't hashing the graph. Usually you just keep track of if/when a change occurred.
You could either use an HasChanged
flag, but I don't like that. I usually use a version counter which is incremented on every change. Then when saving to a file I store the current value of the version counter, and to check if something changed I compare the old versioncounter with the current one.

- 106,488
- 23
- 218
- 262
-
I can not keep track of changes because the are made outside my software. The data gets exported (csv, XML, excel), (possibly) edited, and then later imported again. – bitbonk Mar 20 '11 at 21:25
I ended up doing the following (wich seems to work pretty well):
- create a custom integer hashcode that include all simple properties of a single object using this algorithm.
- repeat 1. for all complex objects that this object references
- serialize all integer hashcode into one binary stream in a well known order
- create a MD5 checksum of this stream