0

I need my app to handle a list of mods from a database and a list of locally downloaded mods that aren't. Each mod of the database has a unique uint ID that I use to identify him but local mods don't have any ID.

At first I tried to generate an ID with string.GetHashCode() by using the mod's name but GetHashCode is still randomized at each run of the app. Is there any other way to generate a persistent uint ID from the mod's name ?

Current code :

foreach(string mod in localMods)
{
    //This way I get a number between 0 and 2147483648
    uint newId = Convert.ToUInt32(Math.Abs(mod.GetHashCode());
    ProfileMod newMod = new ProfileMod(newId);
}

Keelah
  • 192
  • 3
  • 15
  • 4
    Use any hash function (MD5 etc..) you like. But be aware that there may be collisions. – Klaus Gütter Aug 27 '20 at 12:27
  • "GetHashCode is still randomized at each run of the app" . no, THen you basically have a serious code problem. HashCodes should not change between app runs. – TomTom Aug 27 '20 at 12:31
  • 1
    @TomTom Actually, the documentation for [`object.GetHashCode()`](https://learn.microsoft.com/en-us/dotnet/api/system.object.gethashcode?view=netcore-3.1) explicitly states that `In some cases, hash codes may be computed on a per-process or per-application domain basis.` So the value returned from a call to `object.GetHashCode()` could well change between runs. – Matthew Watson Aug 27 '20 at 12:36
  • Collisions should not really be a problem since it's only handling a small amount of mods and even smaller amount of local mods. @TomTom That's what I thought as well... I was using `Convert.ToUInt32(Math.Abs(mod.GetHashCode())` and between two runs, it was different sometimes – Keelah Aug 27 '20 at 12:37
  • In your example of `mod.GetHashCode()`, what is the type of `mod`? `string.GetHashCode()` generally *does* return the same value between runs (but is not guaranteed to do so, and you must never rely on that) – Matthew Watson Aug 27 '20 at 12:38
  • mod is a string (I'll edit my question) – Keelah Aug 27 '20 at 12:45

3 Answers3

5

The method GetHashCode() doesn't return the same value for the same string, especially if you re-run the application. It has a different purpose (like checking the equality during runtime, etc.).
So, it shouldn't be used as a unique identifier.

If you'd like to calculate the hash and get consistent results, you might consider using the standard hashing algorithms like MD5, SHA256, etc. Here is a sample that calculates SHA256:

using System;
using System.Security.Cryptography;
using System.Text;

public class Program
{
    public static void Main()
    {
        string input = "Hello World!";
        // Using the SHA256 algorithm for the hash.
        // NOTE: You can replace it with any other algorithm (e.g. MD5) if you need.
        using (var hashAlgorithm = SHA256.Create())
        {
            // Convert the input string to a byte array and compute the hash.
            byte[] data = hashAlgorithm.ComputeHash(Encoding.UTF8.GetBytes(input));

            // Create a new Stringbuilder to collect the bytes
            // and create a string.
            var sBuilder = new StringBuilder();

            // Loop through each byte of the hashed data
            // and format each one as a hexadecimal string.
            for (int i = 0; i < data.Length; i++)
            {
                sBuilder.Append(data[i].ToString("x2"));
            }

            // Return the hexadecimal string.
            var hash = sBuilder.ToString();

            Console.WriteLine($"The SHA256 hash of {input} is: {hash}.");
        }
    }
}

Though SHA256 produces longer result than MD5, the risk of the collisions are much lower. But if you still want to have smaller hashes (with a higher risk of collisions), you can use MD5, or even CRC32.

P.S. The sample code is based on the one from the Microsoft's documentation.

Just Shadow
  • 10,860
  • 6
  • 57
  • 75
1

So I ended up listening to your advises and found a good answer in another post by using SHA-1

private System.Security.Cryptography.SHA1 hash = new System.Security.Cryptography.SHA1CryptoServiceProvider();

private uint GetUInt32HashCode(string strText)
{
    if (string.IsNullOrEmpty(strText)) return 0;
    
    //Unicode Encode Covering all characterset
    byte[] byteContents   = Encoding.Unicode.GetBytes(strText);
    byte[] hashText       = hash.ComputeHash(byteContents);
    uint   hashCodeStart  = BitConverter.ToUInt32(hashText, 0);
    uint   hashCodeMedium = BitConverter.ToUInt32(hashText, 8);
    uint   hashCodeEnd    = BitConverter.ToUInt32(hashText, 16);
    var    hashCode       = hashCodeStart ^ hashCodeMedium ^ hashCodeEnd;
    return uint.MaxValue - hashCode;
} 

Could probably be optimized but it's good enough for now.

Keelah
  • 192
  • 3
  • 15
  • 1
    I do not think it is necessary to XOR the different parts. A good hash function should spread the entropy evenly over the output, so just taking the first 4 bytes should be sufficient. Also keep in mind that even the best hash function has a fairly high risk of collision with a 32 bit output value. – JonasH Aug 27 '20 at 15:10
  • 1
    Yeah, having a uint as a final result is risky in terms of collisions. Also XORing the parts of the hash might need to be revisited, as by doing math it might turn out that that last operation causes even more collisions. @JonasH, in general it would be better if you store the whole hash (in a string or byte array) instead of uint. – Just Shadow Aug 28 '20 at 08:11
  • 1
    @Just Shadow, obviously keeping the whole hash would be best, but may be difficult for other reasons. So there is a trade of between ease of use and avoiding collisions. I do not know the exact problem domain well enough to judge what is more important. – JonasH Aug 28 '20 at 08:24
  • Alright. I'll revise the code but so far, the hash to uint is my only solution. Users tend to have a great maximum of 200 items so collisions are not very common. Also, all Database IDs are between 0 and 2000000 so far. I still got some room for errors – Keelah Aug 31 '20 at 10:52
1

I wouldn't trust any solution involving hashing or such. Eventually you will end-up having conflicts in the IDs especially if you have huge amount of records on your DB.

What I would prefer to do is to cast the int ID of the DB to a string when reading it and then use some function like Guid.NewGuid().ToString() to generate a string UID for the local ones.

This way you will not have any conflict at all.

I guess that you will have to employ some kind of such strategy.

  • Issue is, I need a specific uint to handle stuff. Otherwise, the API won't work anymore – Keelah Aug 31 '20 at 10:50
  • Then, what I would do is set my local UInt's start from UInt32.MaxValue and for each new local one reduce the value by 1. Of course hoping that the ones in the DB do not exceed the value UInt32.MaxValue / 2 since there might be a conflict there. – Efthymios Kalyviotis Aug 31 '20 at 15:12