0

I'm implementing a caching table to avoid having to perform a costly operation that creates a generic object from a set of parameters describing it. Once an object is requested, an hash of these parameters is computed, and a Dictionary containing the already created objects is queried to check if a copy has already been created, in which case its returned without the need of creating it again.

My problem lies in the fact that since the parameters describing these objects can be many, collisions in the hashing function are unavoidable (and too frequent), but on the other hand retrieving these objects is a performance-critical operation and i cannot afford full comparisons on all existing descriptions to search among already created objects. I've tried to solve with many different hashing functions but since the nature of the parameters is unknown the results are unreliable.

What solutions other than hashing are there to this caching problem, or can hashing be used differently to avoid conflicts? C# description of the problem:

class ObjectDescriptor
{
    // description made of a list of parameters of unknown type
    public object[] Fields;

    // hashing procedure that may have conflicts
    public override int GetHashCode()
    {
        int hash = 1009;
        for (int i = 0; i < Fields.Length; i++)
        {
            unchecked { hash = hash * 9176 + Fields[i].GetHashCode(); }
        }
        return hash;
    }
}

abstract class ObjectCache<T>
{
    private Dictionary<int, T> indexedObjects;

    // this operation is called many times and must be fast
    public T Get(ObjectDescriptor descr)
    {
        T cachedValue;
        if(!indexedObjects.TryGetValue(descr.GetHashCode(), out cachedValue))
        {
            cachedValue = CreateObject(descr);
            indexedObjects[descr.GetHashCode()] = cachedValue;
        }
        return cachedValue;
    }

    // costly operation
    protected abstract T CreateObject(ObjectDescriptor desc);
}
Michele M.
  • 117
  • 1
  • 8
  • 3
    A `Dictionary` solves all these problems and can handle collisions . It is based on a hash table and performs lookups with `O(1)` time complexity. – Olivier Jacot-Descombes Mar 13 '22 at 13:47
  • 1
    IMHO you should use `Dictionary`. Have you compared the cost of hash collisions with `HashCode.Combine` ? https://learn.microsoft.com/en-us/dotnet/api/system.hashcode?view=net-6.0 – Jeremy Lakeman Mar 13 '22 at 13:50
  • If you think that collisions happen too often, you could group your data by key and create a `Dictionary>` instead. And btw, if you do not have a key per se, use a `Hashset` instead to store unique values. – Olivier Jacot-Descombes Mar 13 '22 at 13:55
  • 2
    @OlivierJacot-Descombes That doesn't fix the fundamental problem of hashing collisions and the fact that full comparisons take too long. It sounds like a better hashing function is needed – Charlieface Mar 13 '22 at 13:58
  • Does this answer your question? [Quick and Simple Hash Code Combinations](https://stackoverflow.com/questions/1646807/quick-and-simple-hash-code-combinations) – Charlieface Mar 13 '22 at 14:01
  • If you have a lot of equal values, then collisions are not avoidable and they do not depend on the way hash codes are created or combined. The `GetHashCode` method presented above is okay. – Olivier Jacot-Descombes Mar 13 '22 at 14:06
  • For older versions of .NET, you can copy the source code from https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Private.CoreLib/src/System/HashCode.cs#L96. You could also use the algorithm from [`System.Tuple`](https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Private.CoreLib/src/System/Tuple.cs#L67). There will always be a limit beyond which it will be impossible to avoid collisions (you only have 32 bits) but a good implementation should go a long way. – Charlieface Mar 13 '22 at 14:10
  • 1
    Don't forget if your objects are immutable to cache the hashcodes – Charlieface Mar 13 '22 at 14:11
  • Another idea that you can try is to use a b-tree instead of hash map. – freakish Mar 13 '22 at 15:28

1 Answers1

1

I'll leave the solution I ended up using. This is based on the fact that conflicts can be avoided by storing whole values from multiple fields in a single hash where possible:

byte b1 = 42, b2 = 255;
int naiveHash = CombineHash(b1.GetHashCode(), b2.GetHashCode()); // will always have conflicts
int typeAwareHash = b1 << 8 + b2; // no conflicts

To know how many bits are required by a field I required the implementation of IObjectDescriptorField:

interface IObjectDescriptorField
{
    int GetHashCodeBitCount();
}

I then updated the ObjectDescriptor class with an HashCodeBuilder class:

class ObjectDescriptor
{
    public IObjectDescriptorField[] Fields;

    public override int GetHashCode()
    {
        HashCodeBuilder hash = new HashCodeBuilder();
        for (int i = 0; i < Fields.Length; i++)
        {
            hash.AddBits(Fields[i].GetHashCode(), Fields[i].GetHashCodeBitCount());
        }
        return hash.GetHashCode();
    }
}

HashCodeBuilder stack up bits until all 32 are used, and then uses a simple hash combination function like before:

public class HashCodeBuilder
{
    private const int HASH_SEED = 352654597;

    private static int Combine(int hash1, int hash2)
    {
        return ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hash2;
    }

    private int hashAccumulator;
    private int bitAccumulator;
    private int bitsLeft;

    public HashCodeBuilder()
    {
        hashAccumulator = HASH_SEED;
        bitAccumulator = 0;
        bitsLeft = 32;
    }

    public void AddBits(int bits, int bitCount)
    {
        if (bitsLeft < bitCount)
        {
            hashAccumulator = Combine(hashAccumulator, bitAccumulator);
            bitsLeft = 32;
            hashAccumulator = 0;
        }

        bitAccumulator = bitAccumulator << bitCount + bits;
        bitsLeft -= bitCount;
    }

    public override int GetHashCode()
    {
        return Combine(hashAccumulator, bitAccumulator);
    }
}

This solution of course still have conflicts if more then 32 bits are used, but it worked for me because many of the fields where just bools or Enums with few values, which greatly benefit to be combined like this.

Michele M.
  • 117
  • 1
  • 8