35

I want to do some tests which require some strings with the same hash code, but not the same strings. I couldn't find any examples, so I decided to write a simple program to do it for me.

The code below generates two random strings over and over until they generate the same hash code.

    static Random r = new Random();
    static void Main(string[] args)
    {
        string str1, str2;
        do
        {
            str1 = GenerateString();
            str2 = GenerateString();
        } while (str1.GetHashCode() != str2.GetHashCode() && str1 != str2);

        Console.WriteLine("{0}\n{1}", str1, str2);
    }

    static string GenerateString()
    {
        string s = "";
        while (s.Length < 6)
        {
            s += (char)r.Next(char.MaxValue);
        }
        return s;
    }

This code seems to be working (theoretically), but it may take centuries to complete. So I was thinking of doing vice versa and generate two strings from one hash code.

I know it's not possible to retrieve a string from a hash code, but is it possible to generate possible strings from it?

I'm using Visual Studio 2015 Community Edition. Version: 14.0.23107.0D14REL.

.NET Framework: 4.6.00081.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
M.kazem Akhgary
  • 18,645
  • 8
  • 57
  • 118
  • 2
    If you want just for a test, you can create a class and override `GetHashCode` with a bad implementation that returns a fixed value, effectively producing collisions everywhere. – Alejandro Aug 15 '15 at 17:25
  • It might be possible if you reverse engineer the implementation of `GetHashCode` using Reflector or some similar tool. Do note that the implementation has changed from one .NET framework version to the next and since you didn't mention what version you are using it would not be possible for anyone to provide an answer as to what specifically to do to achieve this. – NightOwl888 Aug 15 '15 at 17:28
  • just read the code for [string.GetHashCode](http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4) – Hamid Pourjam Aug 15 '15 at 17:30
  • You could generate 4.3 billion different strings, one of them must by definition have the same hash :) – DavidG Aug 15 '15 at 17:31
  • 10
    It takes just a few seconds, the Birthday Paradox makes it quick. "1241308".GetHashCode() == "699391".GetHashCode() in 32-bit code, "942".GetHashCode() == "9331582".GetHashCode() in 64-bit code. – Hans Passant Aug 15 '15 at 17:34
  • @Alejandro unfortunately the program im going to test is working with `GetHashCode` – M.kazem Akhgary Aug 15 '15 at 17:34
  • @M.kazemAkhgary - The version of Visual Studio is irrelevant, since each version of Visual Studio supports multiple versions of .NET. It is the version of the .NET framework your project is based on that is needed. – NightOwl888 Aug 15 '15 at 17:37
  • wow. thanks now i have an example! i dont know about Birthday Paradox but ill look into it. @HansPassant – M.kazem Akhgary Aug 15 '15 at 17:37
  • 4
    Taking advantage of the birthday paradox is done by storing all of the strings you've generated so far. Each time you generate a new one you check it against all of the others. It increases the chances of finding a match enormously. If there are N hash codes then your method will have a 50% chance of finding a collision after checking N/2 strings. The birthday method will have a 50% chance after checking √N strings. For 32-bit hash codes where N=~4 billion, this is the difference between checking 2 billion strings or 65 thousand of them. – John Kugelman Aug 15 '15 at 17:45
  • @JohnKugelman That maths probably only applies if the collision rate is consistently spread over all possible hash values. – DavidG Aug 15 '15 at 17:48
  • @DavidG yes but it should be pointed out that John's numbers are the worst case numbers. For a non consistent spread the chances of finding a collision only go up. – Scott Chamberlain Aug 15 '15 at 18:01
  • What exactly are you trying to do? You can't rely on two strings having the same hash, because the hash function can be randomized in certain circumstances (just look at the [string.GetHashCode implementation](http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4)) – Arturo Torres Sánchez Aug 15 '15 at 21:25
  • For an example of not 2, but 3 strings with same md5 hash, see http://natmchugh.blogspot.com/2014/11/three-way-md5-collision.html . Actually, they're images--Brown, Black, and White. Also see the project called 'HashClash'. – toddkaufmann Aug 15 '15 at 22:59
  • 1
    @ScottChamberlain: That's not true. The math only applies if the hash-values are approximately uniformly-random. If they are not, you can easily need many more hashes to find a collision _(ex. `hash(x) = x%INT_MAX`, you'd iterate over `INT_MAX` values before finding a collision)_. – BlueRaja - Danny Pflughoeft Aug 15 '15 at 23:06
  • @BlueRaja-DannyPflughoeft You are correct, I was thinking incorrectly. – Scott Chamberlain Aug 16 '15 at 00:32
  • 1
    If you are using 64 bits, you can just use `"\0Anything"` and `"\0Really, anything will work."`- [Why is String.GetHashCode() implemented differently in 32-bit and 64-bit versions of the CLR?](http://stackoverflow.com/q/6813263/7586). Of course, it is better to look for these strings for you current implementation - `\0` will not always work. – Kobi Aug 16 '15 at 05:16
  • 1
    The most stylish way to do with would be to use an SMT Solver to find a hash code collision. – usr Aug 16 '15 at 10:33

2 Answers2

38

Finding two strings by repeatedly comparing random strings will take practically forever. Instead generate strings and store them in an a dictionary by hashcode. Then look up each. Match found pretty quickly.

MATCH FOUND!! xqzrbn and krumld hash code 80425224

void Main()
{

    var lookup = new Dictionary<int,string>();

    while(true) {
        var s = RandomString();        
        var h = s.GetHashCode();
        string s2;
        if (lookup.TryGetValue(h, out s2) && s2 != s) {
            Console.WriteLine("MATCH FOUND!! {0} and {1} hash code {2}",
                lookup[h],
                s,
                h);
            break;
        }
        lookup[h] = s;

        if (lookup.Count % 1000 == 0) {
            Console.WriteLine(lookup.Count);
        }
    }
}

static Random r = new Random();

// Define other methods and classes here
static string RandomString() {

    var s = ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString() +
            ((char)r.Next((int)'a',((int)'z')+1)).ToString();

    return s;
}
Kobi
  • 135,331
  • 41
  • 252
  • 292
Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
  • Thanks for the answer. @ScottChamberlain and SamuelNeff. it seems the method generating the string also affects the chances! more characters reduces the chance to get same hashcode. – M.kazem Akhgary Aug 15 '15 at 17:56
  • 1
    @M.kazemAkhgary That's not true, You feel that way because the alpha method uses a lot more CPU power dus taking longer to find a match. the number of items should be the same. – Behrooz Aug 15 '15 at 20:18
  • @ScottChamberlain and SamuelNeff, Both of you should use SortedDictionary for maximum performance here. Dictionary will crwal after few hundred thausand items. – Behrooz Aug 15 '15 at 20:19
  • @Behrooz Are you kidding, a `SortedDictionary` has `O(log n)` retrieval. A Dictionary as `O(1)` retrieval. Especially with a `int` key (which has a `1:1` value to hashcode ratio because the implmentation is simply [`GetHashCode() { return m_value; }`](http://referencesource.microsoft.com/mscorlib/R/2d2b91f1bdc71dbb.html)) that is the absolute best case situation for a dictonary lookup. – Scott Chamberlain Aug 15 '15 at 21:14
  • 1
    @Behrooz The only way a dictionary can "crawl" is if you store lots of values that have the exact same HashCode but `Equals(` returns `false`. Using a `int` it is impossible to have `(x.GetHashCode() == y.GetHashCode()) && (x.Equals(y) == false)` – Scott Chamberlain Aug 15 '15 at 21:25
  • @ScottChamberlain Ok, I did a benchmark and I was so very wrong. Thank You. One problem remains. how does a dictionary have O(1) without taking up 2^36 bytes of memory? what's the magic? – Behrooz Aug 15 '15 at 23:12
  • If anyone else is curious and wants to check http://paste.debian.net/292621/ the 1st argument to the program is the number of kilo items and the second is a bitmask for string GetHashCode(), value of 0 makes things interesting. – Behrooz Aug 15 '15 at 23:48
  • @Behrooz You can look at the source code [here](http://referencesource.microsoft.com/#mscorlib/system/collections/generic/dictionary.cs). Basicly it has a [`Entry[] entries`](http://referencesource.microsoft.com/#mscorlib/system/collections/generic/dictionary.cs,6d8e35702d74cf71) and a `int[] buckets`. The code does `buckets[key.GetHashCode() % buckets.Count]` to find the first index in the entires array to go in to, after that it behaves as a singularly linked list jumping through all entries that sharede a bucket. Once `count == entries.Count` it resizes the arrays and re-calculates buckets. – Scott Chamberlain Aug 15 '15 at 23:59
  • 1
    @Behrooz See [this old answer of mine](http://stackoverflow.com/questions/28593764/gethashcode-method-with-dictionary-and-hashset/28593852#28593852) where I go in to the process in more detail than a comment will allow. – Scott Chamberlain Aug 16 '15 at 00:02
  • 1
    @Behrooz note O(1) does not mean fast, it means constant time. It can be constantly slow and still be O(1). That's not to say a dictionary is slow, just saying in general. Performance of a dictionary depends on the number of buckets and distribution of keys across buckets (randomness of hashcode). – Samuel Neff Aug 16 '15 at 00:29
  • Dictionary has *amortized* O(1). The cost of resizing the internal bucket array when a threshold count is reached is spread out over all the other additions. – Lasse V. Karlsen Aug 16 '15 at 10:23
  • @ScottChamberlain A dictionary becomes slow if there are many entries in the same bucket, even if they have distinct hash codes. This can happen with integer keys (though it won't happen here). – CodesInChaos Aug 16 '15 at 12:24
30

Take advantage of the Birthday Paradox. Instead of only testing two strings directly, test all strings you have seen before.

using System;
using System.Collections.Generic;

namespace ConsoleApplication2
{
    class Program
    {
        static void Main(string[] args)
        {
            var words = new Dictionary<int, string>();

            int i = 0;
            string teststring;
            while (true)
            {
                i++;
                teststring = i.ToString();
                try
                {
                    words.Add(teststring.GetHashCode(), teststring);
                }
                catch (Exception)
                {
                    break;
                }
            }

            var collisionHash = teststring.GetHashCode();
            Console.WriteLine("\"{0}\" and \"{1}\" have the same hash code {2}", words[collisionHash], teststring, collisionHash);
            Console.ReadLine();
        }
    }
}

For my machine it produces the output

"699391" and "1241308" have the same hash code -1612916492

almost instantly.

Due to how strings are hashed in .NET you may not get the exact same output as me, but it should be just as fast.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • When running this on .NET Core, the hash code changes every time the application runs, so any hash codes posted in this question will be different than your application. This behavior is described [here](https://andrewlock.net/why-is-string-gethashcode-different-each-time-i-run-my-program-in-net-core/). If you need two strings with the same hash code for testing, as I did, you will need to generate them every time your application runs. Fortunately this takes less than 100 ms. Or you can create a new class with a bad implementation of GetHashCode, as Alejandro suggested. – Jeremiah Mercier Mar 31 '20 at 17:41
  • 1
    Yes, that is why I put *"Due to how strings are hashed in .NET you may not get the exact same output as me, but it should be just as fast."* – Scott Chamberlain Apr 01 '20 at 01:23