1

How probable is a hash collision of two strings in C#? I know that for objects in general, two unequal object are not guaranteed to have unequal hash codes, but how does this behave when the objects are strings.

I specifically need a function from a URL string to a unique key, but don't need anything fancy, its just to cache stuff from the web, skip download if a certain URL was already loaded.

Edit

What if I define a function like this

string UniqueKey (string url) {

    var list = SplitStringInHalf (url);
    var firstHalf = list[0].GetHashCode();
    var secondHalf = list[1].GetHashCode();

    return firstHalf.ToString() + secondHalf.ToString();
}
Cristian Garcia
  • 9,630
  • 6
  • 54
  • 75
  • That depends on the hash algorithm. Take one with low collision probability, like sha256. – Wiktor Zychla Sep 27 '14 at 00:58
  • 1
    @WiktorZychla, he did not mention it, but I think he is talking about GetHashCode(). Cristian please specify which hash function you are using. – BlueTrin Sep 27 '14 at 00:59
  • It looks for me that the question is not related to C# as the hash function is an algorithmic construction, not a language-specific feature. Also it is off-topic for the stack overflow as it is not about the coding. Consider asking on http://programmers.stackexchange.com/ or http://crypto.stackexchange.com/ – Alex Netkachov Sep 27 '14 at 00:59
  • 2
    A hash by definition cannot be unique. – Peter Ritchie Sep 27 '14 at 01:01
  • See http://stackoverflow.com/questions/2733541/what-is-the-best-way-to-implement-this-composite-gethashcode – Peter Ritchie Sep 27 '14 at 01:03
  • "its just to cache stuff from the web, skip download if a certain URL was already loaded." But for the web pages that you have downloaded, do you save the URL as well as the data? In that case the hash code does not have to be unique, you can use a normal HashTable or Dictionary collection that handles the lack of uniqueness and finds the correct data. – RenniePet Sep 27 '14 at 01:12
  • @RinniePet The I need a unique name for every file I save. They are all URL of files. – Cristian Garcia Sep 27 '14 at 01:14
  • Well maybe I don't understand your problem, but can you save the files as "File1", "File2", etc., and have a Dictionary that converts a key (the URL) to the filename. Then the standard string.GetHashCode and the Dictionary's collision handling mechanism provides a fast conversion. (string.GetHashCode() is used behind the scenes by Dictionary - you don't need to worry about it.) – RenniePet Sep 27 '14 at 01:21
  • Why would the method in your edit have any benefit over using the url itself as a key? – Preston Guillot Sep 27 '14 at 01:26

2 Answers2

0

The shorter the strings the higher the probability, here's a good link to calculating that probability:

http://preshing.com/20110504/hash-collision-probabilities/

I think this might answer your question, too:

How many random elements before MD5 produces collisions?

Community
  • 1
  • 1
0

For GetHashCode(), you can refer to this question on Stack Overflow. It will tell you that for small strings it is obviously higher.

In general, independently of your hash method as long as it is sensible, for relatively long strings, the chance is fairly low between 2 strings, but as you have many strings the curse of probability will make it much higher (for example when you add one more person to a group the chance that two people in the group have the same birthday is much higher).

As a general rule, you should not rely on it being unique, you can use it to differentiate as a primary key but then you need to make sure that two strings which have the same hashcode are different if you use it to sort them.

For example, you can use the hashcode to create a hash table, the key will not be unique, but you can do a proper comparison only when you have a collision which simplifies the task of comparing when you have a large number of elements.

Community
  • 1
  • 1
BlueTrin
  • 9,610
  • 12
  • 49
  • 78