23

I have the following two strings:

var string1 = "MHH2016-05-20MASTECH HOLDINGS, INC. Financialshttp://finance.yahoo.com/q/is?s=mhhEDGAR Online FinancialsHeadlines";

var string2 = "CVEO2016-06-22Civeo upgraded by Scotia Howard Weilhttp://finance.yahoo.com/q/ud?s=CVEOBriefing.comHeadlines";

At first glance these two strings are different however their hashcode is the same using the GetHashCode method.

var hash = 0;
var total = 0;
foreach (var x in string1) //string2
{
    //hash = x * 7;
    hash = x.GetHashCode();
    Console.WriteLine("Char: " +  x + " hash: " + hash + " hashed: " + (int) x);
    total += hash;
}

Total ends up being 620438779 for both strings. Is there another method that will return a more unique hash code? I need the hashcode to be unique based on the characters in the string. Although both strings are different and the code works properly, these two strings so happen add up to being the same. How can I improve this code to make them more unique?

live2
  • 3,771
  • 2
  • 37
  • 46
some random dude
  • 457
  • 1
  • 4
  • 11
  • 6
    You do realize, don't you, that you can't guarantee a unique hash code for all possible strings? A hash code is 32 bits, meaning that there are 4 billion (and change) possible values. Each of your two strings is more than 120 characters long. The number of possible 120-character strings using the 96 printable ASCII characters is is much, much larger. Collisions are inevitable. *There is no such thing as a unique hash code* in the general case. Making the hash code larger will reduce the chance of collision, but will not eliminate it. – Jim Mischel Jun 27 '16 at 00:41
  • 2
    Your question implies that you're trying to use hash codes as unique identifiers. This is an incredibly bad idea, and doomed to fail. The answer by @AlexD explains why. – Jim Mischel Jun 27 '16 at 00:46
  • @JimMischel yes I am aware of this now but thanks – some random dude Jun 27 '16 at 01:50
  • Possible duplicate of [Generate unique number based on string input in Javascript](https://stackoverflow.com/questions/15377161/generate-unique-number-based-on-string-input-in-javascript) – Codebeat Jan 28 '18 at 21:52
  • Old question, I know, see my 3 years earlier question and answer: https://stackoverflow.com/questions/15377161/generate-unique-number-based-on-string-input-in-javascript – Codebeat Jan 28 '18 at 21:53
  • Related: [What is hashCode used for? Is it unique?](https://stackoverflow.com/q/7425142). – dbc Jan 28 '18 at 23:26

2 Answers2

40

string.GetHashCode is indeed inappropriate for real hashing:

Warning

A hash code is intended for efficient insertion and lookup in collections that are based on a hash table. A hash code is not a permanent value. For this reason:

  • Do not serialize hash code values or store them in databases.
  • Do not use the hash code as the key to retrieve an object from a keyed collection.
  • Do not use the hash code instead of a value returned by a cryptographic hashing function. For cryptographic hashes, use a class derived from the System.Security.Cryptography.HashAlgorithm or System.Security.Cryptography.KeyedHashAlgorithm class.
  • Do not test for equality of hash codes to determine whether two objects are equal. (Unequal objects can have identical hash codes.) To test for equality, call the ReferenceEquals or Equals method.

and has high possibility of duplicates.

Consider HashAlgorithm.ComputeHash. The sample is slightly changed to use SHA256 instead of MD5, as @zaph suggested:

static string GetSha256Hash(SHA256 shaHash, string input)
{
    // Convert the input string to a byte array and compute the hash.
    byte[] data = shaHash.ComputeHash(Encoding.UTF8.GetBytes(input));

    // Create a new Stringbuilder to collect the bytes
    // and create a string.
    StringBuilder sBuilder = new StringBuilder();

    // Loop through each byte of the hashed data 
    // and format each one as a hexadecimal string.
    for (int i = 0; i < data.Length; i++)
    {
        sBuilder.Append(data[i].ToString("x2"));
    }

    // Return the hexadecimal string.
    return sBuilder.ToString();
}
Community
  • 1
  • 1
AlexD
  • 32,156
  • 3
  • 71
  • 65
  • for the complete example see https://msdn.microsoft.com/en-us/library/system.security.cryptography.md5(v=vs.110).aspx – lexx9999 Jun 26 '16 at 23:14
  • @lexx9999 I think the link in the post points to the same algorithm already. – AlexD Jun 26 '16 at 23:18
  • As I was reading it, it did not include GetMd5Hash/ VerifyMd5Hash – lexx9999 Jun 26 '16 at 23:23
  • This would result in slow performance, security hashing is not mean to be used for avoiding collision. – sm_ Jun 26 '16 at 23:26
  • @lexx9999 Ah, yeas, perhaps the post editing was in progress at that moment. – AlexD Jun 26 '16 at 23:29
  • 2
    @SirajMansour Cryptographic hashes are indeed designed for avoiding collisions. On my iPhone I can compute a SHA-256 hash on a 1MB file in 0.950 mSec, is that fast enough? BTW, SHA-256 is slightly faster than MD5 on my phone. – zaph Jun 27 '16 at 01:40
  • The chance of a collision using SHA-256 is 1 in 2^128 or 1 in 340282366900000000000000000000000000000000000000. As far as we know no collision has been seen in SHA-256. – zaph Jun 27 '16 at 02:04
7
using System.Security.Cryptography;
string data="test";
byte[] hash;
using (MD5 md5 = MD5.Create())
{
    md5.Initialize();
    md5.ComputeHash(Encoding.UTF8.GetBytes(data));
    hash = md5.Hash;
}

hash is a 16 byte array, which in turn you could covert to some hex-string or base64 encoded string for storage.

EDIT:

What's the purpose of that hash code?

From hash(x) != hash(y) you can derive x!=y, but

from hash(x) == hash(y) you canNOT derive x==y in general!

rohitt
  • 1,124
  • 6
  • 18
lexx9999
  • 736
  • 3
  • 9
  • 2
    This would result in slow performance, security hashing is not mean to be used for avoiding collision. – sm_ Jun 26 '16 at 23:26
  • @lexx9999 the purpose of the hash code is to differentiate from duplicates. The strings are returned from a web scraper. I have tried your suggested code for the two strings I used as an example in my question which returns unique hashcodes however when I try this on my overall program I still get duplicates. – some random dude Jun 26 '16 at 23:47
  • 1
    @somerandomdude, As with every hash function you would have to compare the original data, in case you want to be absolutely sure, You can try other hash-algorithms, but you must always expect collisions. That's what `from hash(x) == hash(y) you canNOT derive x==y in general!` means. – lexx9999 Jun 26 '16 at 23:58
  • Wasn't the answer I was hoping for but this makes sense. Thanks, I will look into other algorithms. – some random dude Jun 27 '16 at 00:17
  • 1
    MD5 should not be used, it is insecure, use a SHA-2 hash such as SHA-256. Generally SHA-256 is minimally slower than MD5. – zaph Jun 27 '16 at 01:36
  • @SirajMansour Cryptographic hashes are indeed designed for avoiding collisions. – zaph Jun 27 '16 at 01:45
  • 2
    @zaph they're designed for security purposes. Although it would serve the goal in avoiding collision doesn't mean its the right tool. – sm_ Jun 27 '16 at 02:16
  • I guess GIT also used the wrong tool and Linus did not know that. Cryptographic hashes were designed of collision resistance. And now are you going to tell us the right tool. – zaph Jun 27 '16 at 02:25
  • 1
    @zaph you think request routing to sharded datastores based on hashing objects would use a cryptographic hash ? you need to do some reading mate. I guess the people who designed non-crypto hash algos like Murmur,FNV, SuperFastHash are all mistaken, they should all take your word on this one. – sm_ Jun 27 '16 at 03:09
  • Perhaps you should read the question, mate. But why should we believe that any of Murmur, FNV or SuperFastHash are well vetted and collision resistant at least to the standard of SHA-2 functions? Maybe they are good, maybe they aren't, but we know a lot about SHA-2 functions. What is the point of using these functions, are they more collision resistant? Not likely since there are no known flaws in SHA-2 functions. Are they significantly faster? If this is the claim provide benchmarks. Otherwise the rant against cryptographic hash functions is just FUD. – zaph Jun 27 '16 at 03:54
  • 1
    I am not ranting against crypto hash functions or the fact that they're proven to work where they're supposed to. I am just saying that if the sole purpose is randomness, depending on the performance needs crypto hash-algos are not always the answer. Now you can argue back and be aggressive as you want, but try to accept other opinions. It doesn't hurt. – sm_ Jun 27 '16 at 03:59
  • The problem is the supposition of where they are supposed to be used. A key element is collision resistantancn, in fact that is why MD5 and SHA-1 are no longer recommended, that is collisions. – zaph Jun 27 '16 at 11:46