1

Question

What's the best way to implement a function that given an object is able to return a hash key?

The requirements would be:

  • HashCodeFn(((bool?)false, "example")) != HashCodeFn(((bool?)null, "example"))
  • Is relatively cheap to calculate
  • Works with any type without any specific requirement (e.g. [Serializable] attribute)

What I've tried

I've tried with .GetHashCode but:

  • It's unreliable for things like null vs 0 vs false
  • Needs to be implemented for every type

I've tried with:

    private static int GetHashKey<T>(T input)
    {
        using var memoryStream = new MemoryStream();
        BinaryFormatter formatter = new BinaryFormatter();
        formatter.Serialize(memoryStream, input);
        memoryStream.Position = 0;
        using var reader = new StreamReader(memoryStream);
        return reader.ReadToEnd().GetHashCode();
    }

but:

  • It requires that all types in the object tree implement [Serializable] (some types I have no control over and don't implement those)

I'm thinking of serialising the object to JSON in the most compact form and then get the GetHashCode of that string, but I'm not sure how well it works with something like NodaTime.Instant. Is that the fastest way to accomplish this?


Specific use case

This is used as a data loader key (see github.com/graphql/dataloader for an example) if that is of any help to understand the use case.

Specifically a data loader key is used to handle batching. When you have many requests with input (a, b, c) and you want to "pivot" on, for example, a (which means that (1, b, c), (2, b, c), (3, b, c) should call a batch function fn([1, 2, 3], (b, c)) then you need to be able to define a key that is the same for the same values of (b, c) to use as the data loader key.

From the input perspective, specifying or not a bool on something like b, for example, is considered to be 2 different things and should be batched on two different functions.

If I were to use (b, c).GetHashCode() then I it would consider ((bool?)false, "ok") and ((bool?)null, "ok") to be the same thing, therefore batching them to the same batch function yielding unexpected results.

Shoe Diamente
  • 723
  • 1
  • 5
  • 24
  • Does this answer your question? [What is the best algorithm for overriding GetHashCode?](https://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-overriding-gethashcode) – Pavel Anikhouski Jan 16 '20 at 16:35
  • @PavelAnikhouski Doesn't seem so because it relies on `.GetHashCode` for inner fields which doesn't fit the requirements. For example `0.GetHashCode()` and `false.GetHashCode()` yield the same value. – Shoe Diamente Jan 16 '20 at 16:40
  • What do you mean GetHashCode is "unreliable for things like null vs 0 vs false"? –  Jan 16 '20 at 16:44
  • For example `0.GetHashCode()` and `false.GetHashCode()` yield the same value and therefore don't satisfy the first requirement. – Shoe Diamente Jan 16 '20 at 16:45
  • @ShoeDiamente So? Why is that a problem? –  Jan 16 '20 at 16:45
  • @Amy It doesn't satisfy the first requirement specified in the question. – Shoe Diamente Jan 16 '20 at 16:45
  • 1
    @ShoeDiamente I am asking why that is a requirement. Why is `0` hashing to the same thing as `false` a problem? –  Jan 16 '20 at 16:45
  • @Amy Because I need to generate the same key for "equal" (in this definition `null`, `0` and `false` are all not equal) objects and different keys (with a certain collision tolerance) for different objects. This is used as a data loader key (see https://github.com/graphql/dataloader for an example) if that is of any help to understand the use case. – Shoe Diamente Jan 16 '20 at 16:50
  • @Amy Specifically a data loader key is used to handle batching. When you have many requests with input `(a, b, c)` and you want to "pivot" on, for example, `a` (which means that `(1, b, c), (2, b, c), (3, b, c)` should call a batch function `fn([1, 2, 3], (b, c)`) then you need to be able to define a key that is the same for the same values of `(b, c)` to use as the data loader key. From the input perspective, specifying or not a bool on something like `b`, for example, is considered to be 2 different things and should be batched on two different functions. – Shoe Diamente Jan 16 '20 at 16:54
  • @Amy If I were to use `(b, c).GetHashCode()` then I it would consider `((bool?)false, "ok")` and `((bool?)null, "ok")` to be the same thing, therefore batching them to the same batch function yielding unexpected results. – Shoe Diamente Jan 16 '20 at 16:57
  • @ShoeDiamente I feel that that information should be in the question. My instinct was that you were overthinking hashing. I don't think you can achieve what you want in the context of general objects. For a specific type, sure, you could hash `0` XOR'd with a prime, and hash `false` with a different prime. That said, you could take a look at [this answer](https://stackoverflow.com/a/56539595/47589) –  Jan 16 '20 at 16:59
  • @ShoeDiamente Would a hash function that combines the type's hash and the value's hash work for you? That would allow `0` to hash differently than `false`... – Kit Jan 16 '20 at 17:18

1 Answers1

1

I don't think there's any particularly efficient way to do what you want. Some sort of additional processing will be required to make sure you're getting appropriate hash codes. Also, keep in mind that if the classes you don't control already implement Equals and GetHashCode and Equals returns true for instance that differ only by something like a nullable boolean being false or null, then it's incorrect for GetHashCode to return different values.

You could serialise to JSON to achieve what you want. That will exclude any fields that might happen to be annotated for exclusion. Assuming none of the fields relevant to a hash code are excluded then that'll work. Alternatively you could write extension functions for the types that are going to cause clashes and customise the hashing for those fields. Then use reflection (which is likely to be used in serialising to JSON as well) to iterate over class members and get hash codes using your extensions where necessary. Something along the lines of the code below.

class ThingToHash
{
    public bool? CouldBeFalseOrNullOrNull { get; }
    public int IncludesZero { get; }
    public string CanBeEmptyOrNull { get; }
    private string Hidden { get; }

    public ThingToHash(bool? couldBeFalseOrNull, int includesZero, string canBeEmptyOrNull)
    {
        CouldBeFalseOrNullOrNull = couldBeFalseOrNull;
        IncludesZero = includesZero;
        CanBeEmptyOrNull = canBeEmptyOrNull;
    }
}

static class StringExtensions
{
    public static int GetAltHashCode(this string toHash)
    {
        return toHash?.GetHashCode() ?? 17;
    }
}

static class NullableBoolExtensions
{
    public static int GetAltHashCode(this bool? toHash)
    {
        return toHash?.GetAltHashCode() ?? true.GetHashCode() * 19;
    }
}

static class BoolExtensions
{
    public static int GetAltHashCode(this bool toHash)
    {
        if (false == toHash)
        {
            return true.GetHashCode() * 17;
        }

        return toHash.GetHashCode();
    }
}

class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine(false.GetHashCode());
        Console.WriteLine(((bool?)null).GetHashCode());
        Console.WriteLine(false == (bool?)null);

        Console.WriteLine(HashUnknownObject(new ThingToHash(null, 0, "")));
        Console.WriteLine(HashUnknownObject(new ThingToHash(false, 0, "")));

        Console.ReadKey();
    }

    static int HashUnknownObject(Object toHash)
    {
        PropertyInfo[] members = toHash.GetType().GetProperties(BindingFlags.Instance | BindingFlags.NonPublic | BindingFlags.Public);
        int hash = 17;

        foreach (PropertyInfo memberToHash in members)
        {
            object memberVal = memberToHash.GetValue(toHash);

            if (null == memberVal)
            {
                if (typeof(bool?) == memberToHash.PropertyType)
                {
                    hash += 31 * ((bool?)null).GetAltHashCode();
                }
                else if (typeof(string) == memberToHash.PropertyType)
                {
                    hash += 31 * ((string)null).GetAltHashCode();
                }
            }
            else
            {
                hash += 31 * memberToHash.GetValue(toHash).GetHashCode();
            }
        }

        return hash;
    }
}

You'd obviously have to add other checks to use the bool extension, add other extensions and so on to cover the cases you need. And do testing to check the impact of using reflection to serialise. You could reduce that for classes that already implement GetHashCode, for instance, by not generating hash codes per member for those.

And this code can obviously be cleaned up. It's just quick and dirty here.

Craig
  • 119
  • 7