6

By "increasingly" what I mean is that Add is fast at the beginning when there is a low number of keys. After inserting 20% of the keys, it gets very slow. After 50% it gets unbearably slow.

I get that the lower the number of keys, the faster the "key collision search" when adding new elements to the dictionary. But is there any possible way to skip this downside while keeping the Dictionary? I know beforehand that keys don't collide so no check is needed, but I don't know if there is any way to successfully use this info in the code.

BTW I am forced to use the dictionary structure because of architecture restrictions (this structure is swallowed later by a db exporter).


What my code does:

var keyList = GetKeyList();
var resultDict = new Dictionary<T,T>();
foreach (var key in keyList)
{
    resultDict.Add(key,someResult);
}

Edit: since people is asking how the hash code is generated, I will try to clarify this.

Theoretically I have no control over the hash code generation, because unfortunately it uses a convention between multiple systems that are connected through the same db.

In practice, the piece of code that generates the hash code is indeed my code (disclaimer: it wasn't me choosing the convention that is used in the generation).

The key generation is way more complicated than that, but it all boils down to this:

private List<ResultKey> GetKeyList(string prefix, List<float> xCoordList, List<float> yCoordList)
{
    var keyList = new List<ResultKey>();
    var constantSensorName = "xxx";
    foreach (float xCoord in xCoordList)
    {
        foreach (float yCoord in yCoordList)
        {
            string stationName = string.Format("{0}_E{1}N{2}", prefix, xCoord, yCoord);
            keyList.Add(new ResultKey(constantSensorName, stationName));
        }
    }
    return keyList;
}

public struct ResultKey
{
    public string SensorName { get; set; }
    public string StationName { get; set; }

    public ResultKey(string sensorName, string stationName)
    {
        this.SensorName = sensorName;
        this.StationName = stationName;
    }
}
Xavier Peña
  • 7,399
  • 9
  • 57
  • 99
  • 2
    What are your keys? Probably bad hash code generator. – usr Aug 04 '15 at 14:11
  • 1
    "I know beforehand that keys don't collide so no check is needed, but I don't know if there is any way to successfully use this info in the code." How do you know this? – Jerry Federspiel Aug 04 '15 at 14:11
  • 3
    Have you tried specifying a large initial capacity to avoid rehashing? – Jerry Federspiel Aug 04 '15 at 14:13
  • @usr Each key is a `struct` that contains a pair of strings. Those strings are later necessary for the db exporter to find where the data must be inserted. In this particular case, the strings of each key are quite long and quite similar between them (hence the slow collision search). – Xavier Peña Aug 04 '15 at 14:20
  • @XavierPeña The fact that the items are unique doesn't mean that you're not getting a lot of collisions. You're probably getting a lot of collisions. And having large similar strings also means an expensive hash code generation function, and an expensive equality check when there are collisions. – Servy Aug 04 '15 at 14:23
  • @XavierPeña post the code that generate the hash code. Probably bad. – usr Aug 04 '15 at 14:25
  • Do you have any control over the db exporter? If you really just need an IEnumerable> you can just use a List and get away with not making any equality or hash comparisons at all. – Jerry Federspiel Aug 04 '15 at 14:28
  • Also, when you say "the strings are quite similar", is it the case that they often share a common prefix but are mostly different at the end of the string, or are rare differences somewhat randomly scattered throughout the strings? This would affect efficient strategies for comparison. – Jerry Federspiel Aug 04 '15 at 14:30
  • @JerryFederspiel I just tried using `capacity` and the insertion speed seems the same as before. About how I know they don't collide: in this case, one of the strings in the key is constant and the other string is unique name (and this condition is checked beforehand). – Xavier Peña Aug 04 '15 at 14:35
  • Can you please post your definition for GetHashCode? – Jerry Federspiel Aug 04 '15 at 14:39
  • @XavierPeña That in no way means that you don't have any collisions. There are an infinite number of possible strings, but only ~4 billion possible hash code values. Not all strings can have a unique hash code. That you have all unique strings doesn't mean you have all unique hash codes. – Servy Aug 04 '15 at 14:58
  • @usr Took me a while but here it is (see Edit in the post). – Xavier Peña Aug 04 '15 at 14:59
  • @JerryFederspiel Updated, see Edit in the post. – Xavier Peña Aug 04 '15 at 14:59
  • 2
    @Servy he did, because it's not there :) See my answer. – usr Aug 04 '15 at 15:00
  • You are showing us how you are generating ResultKey, but not an implementation of ResultKey.GetHashCode or an `IEqualityComparer`. @usr 's got this. – Jerry Federspiel Aug 04 '15 at 15:08
  • @usr That's right :( Thank you for teaching me about that. Could not even understand your first comment before, now with all the discussion and your answer I am beginning to grasp the concept and I see how the collision check has many workarounds. – Xavier Peña Aug 04 '15 at 15:27

4 Answers4

7

The first thing that comes to mind is to create your own hashing function. The Add method for the dictionary is going to call the default implementation of the getHashCode() method when it goes to add it to the structure. If you put a wrapper class around your keys and overwrote the getHashCode() method, then you could write your own hashing function which, presumably, could implement a less collision prone hash function.

Simo9000
  • 97
  • 4
  • 3
    Also, if OP doesn't control the classes to be added, he can specify an IEqualityComparer to the dictionary constructor. – Jerry Federspiel Aug 04 '15 at 14:10
  • 2
    I'm sure the idea about the hash code goes in the right direction but I find it a better approach to wait until the OP has posted more complete information. We have not seen the code that generates the hash code yet. – usr Aug 04 '15 at 14:29
3

You are using the default hash code generation for your struct ResultKey. The default hash code generation for structs is disappointingly bad. You can't rely on that here because your struct contains two strings which trigger a bad case (see the linked answer). Essentially, only your SensorName field makes it into the hash code, nothing else. That causes all keys with the same SensorName to collide.

Write your own function. I quickly generated one using Resharper:

public struct ResultKey : IEquatable<ResultKey>
{
    public string SensorName { get; set; }
    public string StationName { get; set; }

    public ResultKey(string sensorName, string stationName)
    {
        this.SensorName = sensorName;
        this.StationName = stationName;
    }

    public bool Equals(ResultKey other)
    {
        return string.Equals(SensorName, other.SensorName) && string.Equals(StationName, other.StationName);
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        return obj is ResultKey && Equals((ResultKey)obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            return ((SensorName != null ? SensorName.GetHashCode() : 0)*397) ^ (StationName != null ? StationName.GetHashCode() : 0);
        }
    }

    public static bool operator ==(ResultKey left, ResultKey right)
    {
        return left.Equals(right);
    }

    public static bool operator !=(ResultKey left, ResultKey right)
    {
        return !left.Equals(right);
    }
}
Community
  • 1
  • 1
usr
  • 168,620
  • 35
  • 240
  • 369
  • 1
    Note also the common prefix at the beginning of the stationname. It may be worth checking the stationname strings with a backwards for loop. – Jerry Federspiel Aug 04 '15 at 15:02
  • 1
    @JerryFederspiel I'm very sure that fixing the quadratic time complexity issue is enough to resolve the issue. It is rare that hash code computation is a performance problem. Good point in general, though. – usr Aug 04 '15 at 15:03
  • I am sorry if I am struggling to keep the pace, but I am going back and forth from vb.net (my code) to c# (this post, because the vb.net tag gets little attention). So I tested the solution and this is what happened: `Arithmetic operation resulted in an overflow, at System.Collections.Generic.GenericEqualityComparer'1.GetHashCode(T obj)`. This happened at the first `GetHashCode()` call. – Xavier Peña Aug 04 '15 at 15:19
  • 2
    You need to find a VB.NET solution to generating a good hash code. Search for "vb.net gethashcode implementation" to find something good. Not sure why VB does not have unsigned arithmetic. It does not seem to complicate the language much. – usr Aug 04 '15 at 15:23
  • And by "unsigned" I mean unchecked arithmetic. – usr Aug 04 '15 at 15:39
  • 1
    So the solution in VB.NET seems to be the following: calculate `hash` as `Long`, but then return it as `Integer` with `Return CInt(hash And &H7FFFFFFFL)`. This way the overflow is avoided. Now it works like a charm, thanks. Saved my day and learned a lot. – Xavier Peña Aug 04 '15 at 15:45
2

Your ResultKey contains two strings, so you need a hashcode that combine them.

"How do I calculate a good hash code for a list of strings?" contains some answer showing how to do this.

However you get do a lot worse then

public override int GetHashCode()
{   
   return (SensorName + StationName).GetHashCode();
}
Community
  • 1
  • 1
Ian Ringrose
  • 51,220
  • 55
  • 213
  • 317
-2

If you just want to fulfill API requirements and need a dirty solution, you could implement your own Dictionary.

public class FakeFastDictionary<TKey, TValue> : Dictionary<TKey, TValue>
{
    protected IList<KeyValuePair<TKey, TValue>> _list
        = new List<KeyValuePair<TKey, TValue>>();

    public new void Add(TKey key, TValue value)
    {
        _list.Add(new KeyValuePair<TKey, TValue>(key, value));
    }

    public new ICollection<TValue> Values
    {

        get
        {
            // there may be faster ways to to it:
            return _list.Select(x => x.Value).ToArray();
        }
    }

    public new ICollection<TKey> Keys
    {
        get
        {
            // there may be faster ways to to it:
            return _list.Select(x => x.Key).ToArray();
        }
    }
}

This is a running sample: https://dotnetfiddle.net/BDyks0

Hannes
  • 426
  • 3
  • 12
  • At first I thought this might work, and indeed suggested something very similar in the comments... but the problem is that we don't know if the consumer of this dictionary needs fast access to the values given a key. It would be appropriate to add disclaimers about what the tradeoffs of this solution are. – Jerry Federspiel Aug 04 '15 at 14:54