0

Hello SO community I need some hashing expertise:

some context

I am faced with the problem to compare two lists of objects. One instance of the list is in a .Net Core 2.0 application and the other in a .Net 4.5.2 application.

To compare the two lists (ensure that they are the same in both applications) I would like to calculate a hash of the list and exchange that between the apps. To calculate a hash of the list I'm using the approach described in this answer.

For this you need a hashcode of the objects in the list, and this is where my problem comes in: It seems a well known fact (1) that .Net core uses randomised hash code behaviour for strings. To calculate the hashcode of my objects I would need to include hashcodes of strings.

Leading to my question: Is it a good idea to calculate a hash of a string using SHA256?

The reasons why I would like to use it:

  • Same output size (256 bit can be interpreted Int32)
  • SHA should always produce the same output (also for future .Net/core versions)
  • It is easier to share between the applications than a custom implementation

Are there better (more efficient, easier, less collisions) alternatives?

MrWombat
  • 649
  • 9
  • 19
  • 1
    Have you considered an MD5 Checksum? – Leo Dec 14 '18 at 15:03
  • 2
    Bear in mind that a hash/checksum will help you establish if two objects are definitely different, but there is always the possibility that different objects result in the same hash/checksum. – Leo Dec 14 '18 at 15:07
  • I know about collisions but this is not super critical here. MD5 is a good suggestion thank you! – MrWombat Dec 14 '18 at 15:19

1 Answers1

2

It is likely unnecessary to use crypto hash functions due to significant computation cost and very low benefit for such usage of hash code. Some basic hash function with just addition and multiplication would be enough - see What is the best algorithm for an overridden System.Object.GetHashCode? for example of good hash function for arrays/multiple fields (similar to one you've linked int the question). Requirements put on crypto hash functions are way stronger than usually needed to put values in hash tables or basic inequality checks.

Notes:

  • SHA256 gives you 8 times more bits than needed for regular GetHashCode (256 vs. 32). To get meaningful benefit you'd need to update the rest of the code to compute 256-bit hash code.
  • SHA256 (and other crypo algorithms) work on byte arrays - you'd have to convert strings to byte arrays to compute hash making it even slower. At this point you may consider serializing whole data structure into byte array and computing SHA256 once.
  • unless you have limited set of values so you can find perfect hash function you always have to deal with possibility of collision: equal hash codes do not mean equal values. Longer hash values will decrease chance of collision birthday problem so you may want adjust hash code length to your needs.
  • if you need to make your hashing publicly known SHA256 is an easy method to describe... but you need to be very careful to explain how strings are converted to byte arrays (encoding is one important part) and if any normalization is needed before that (including String.Normalize ).

  • consider some other mechanisms - maybe versioning of the data or immutable data structures to implement whatever you are looking for.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179