9

I have a big chunck of data containing ~1.5 million entries. Each entry is an instance of a class like this:

public class Element
{
    public Guid ID { get; set; }
    public string name { get; set; }
    public property p... p1... p2... 
}

I have a list of Guids (~4 millions) that I need to get the names based on a list of instances of the Element class.

I'm storing the Element objects in a Dictionary, but it takes ~90 seconds to populate the data. Is there any way to improve performance when adding items to the dictionary? The data doesn't have duplicates but I know that the dictionary checks for duplicates when adding a new item.

The structure doesn't need to be a dictionary if there's a better one. I tried to put the Element objects in a List which performed a lot better (~9sec) when adding. But then when I need to look for the item with a certain Guid it takes more than 10min to find all the 4million elements. I tried that using List.Find() and manually iterating through the list.

Also, if instead of using System.Guid I convert them all to String and store their string representation on the data structures the whole both operations of populating the dictionary and filling the names on the other list takes only 10s, but then my application consumes 1.2Gb of RAM, instead of 600mb when I store them as System.Guid.

Any ideas on how to perform it better?

rbasniak
  • 4,484
  • 11
  • 51
  • 100
  • 1
    Take a look [here](http://stackoverflow.com/questions/875222/performance-of-dictionaryof-string-somereferencetype-in-vb-net) for performance of dictionary. – Eminem Jul 22 '15 at 12:40
  • 8
    Sounds like you really should be storing this data in a database, rather than creating it in-memory in your app. – David Arno Jul 22 '15 at 12:41
  • 3
    You may consider the [hashtable](https://msdn.microsoft.com/en-us/library/system.collections.hashtable%28v=vs.110%29.aspx) which have quite good performances – Thomas Ayoub Jul 22 '15 at 12:43
  • @DavidArno Actually, I'm reading the data from a database, but those 4 millions of queries takes hours, because of this I decided to load the entire table on memory and then use it to do the search. Which is performing infinitely better, but I'd still like to improve it a little more. – rbasniak Jul 22 '15 at 12:44
  • 3
    Can you post some more code? How are you creating the dictionary? Does it help if you override `Equals` and `GetHashCode`? – Kobi Jul 22 '15 at 12:45
  • 4
    How have you defined `Equals` and `GetHashCode`? – D Stanley Jul 22 '15 at 12:45
  • What are you using as a key of the dictionary? The Element instances or the Guid? If the former, you'll need to implement Equals() and a _good_ hash in GetHashCode(). And preferably implement IEquatable too. – Willem van Rumpt Jul 22 '15 at 12:46
  • @RBasniak Maybe you could ask a question about improving the performance of your DB queries? – juharr Jul 22 '15 at 12:47
  • 1
    How many minutes will it take to find all 4 million items from a DB? 1 at a time. Definately more than 10 minutes. Best case of 1ms per check -> 66,66 minutes. I guess he wants a high performance lookup system some sort and want to speed up the init process. In memory lookups from dictionary is way faster than DB lookups with that amount of lookups. – Wolf5 Jul 22 '15 at 12:47
  • 2
    @Wolf5, where does this "best case" 1 ms come from? – dymanoid Jul 22 '15 at 12:49
  • @DStanley I'm not, the dictionary Key is the ID property of the Element class, and not the Element class itself. This property is of type System.Guid. I also tried to use the class as the Key, and then implement Equals and GetHashCode (the hash as a checksum of the bytes of the Gui) but it performed worse than now. – rbasniak Jul 22 '15 at 12:50
  • @dynamoid, Network experience. TCP packets back and forth. – Wolf5 Jul 22 '15 at 12:50
  • @juharr I did before heading to this approach http://stackoverflow.com/questions/30246185/how-to-speed-up-reading-large-amounts-of-data-in-sql-server-and-c-sharp – rbasniak Jul 22 '15 at 12:51
  • 4
    Are you creating your dictionary with a big enough initial capacity? – sstan Jul 22 '15 at 12:53
  • 3
    I doubt your numbers... On my laptop, adding 1.5m of `Guid` to a `Dictionary` happens in less than a second, and searching 4m of `Guid` is 1.3s – xanatos Jul 22 '15 at 13:25
  • I confirmed @xanatos's test. Even without specifying capacity it is still about 1s-2s. – tia Jul 23 '15 at 04:57
  • Like the others, my test implementation was quick. Possibly @xanatos's answer has a clue to the problem? – Niall Connaughton Jul 23 '15 at 07:38

3 Answers3

9

Your problem is perhaps connected to "sequential" Guid, like:

c482fbe1-9f16-4ae9-a05c-383478ec9d11
c482fbe1-9f16-4ae9-a05c-383478ec9d12
c482fbe1-9f16-4ae9-a05c-383478ec9d13
c482fbe1-9f16-4ae9-a05c-383478ec9d14
c482fbe1-9f16-4ae9-a05c-383478ec9d15

The Dictionary<,> has a problem with those, because they often have the same GetHashCode(), so it has to do some tricks that change the search time from O(1) to O(n)... You can solve it by using a custom equality comparer that calculates the hash in a different way, like:

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        var bytes = obj.ToByteArray();

        uint hash1 = (uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24);
        uint hash2 = (uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24);
        uint hash3 = (uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24);
        uint hash4 = (uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24);

        int hash = 37;

        unchecked
        {
            hash = hash * 23 + (int)hash1;
            hash = hash * 23 + (int)hash2;
            hash = hash * 23 + (int)hash3;
            hash = hash * 23 + (int)hash4;
        }

        return hash;
    }

    #endregion
}

Then you simply declare the dictionary like this:

var dict = new Dictionary<Guid, Element>(ReverseGuidEqualityComparer.Default);

a little test to see the difference:

private static void Increment(byte[] x)
{
    for (int i = x.Length - 1; i >= 0; i--)
    {
        if (x[i] != 0xFF)
        {
            x[i]++;
            return;
        }

        x[i] = 0;
    }
}

and

// You can try timing this program with the default GetHashCode:
//var dict = new Dictionary<Guid, object>();
var dict = new Dictionary<Guid, object>(ReverseGuidEqualityComparer.Default);
var hs1 = new HashSet<int>();
var hs2 = new HashSet<int>();

{
    var guid = Guid.NewGuid();

    Stopwatch sw = Stopwatch.StartNew();

    for (int i = 0; i < 1500000; i++)
    {
        hs1.Add(ReverseGuidEqualityComparer.Default.GetHashCode(guid));
        hs2.Add(guid.GetHashCode());
        dict.Add(guid, new object());
        var bytes = guid.ToByteArray();
        Increment(bytes);
        guid = new Guid(bytes);
    }

    sw.Stop();

    Console.WriteLine("Milliseconds: {0}", sw.ElapsedMilliseconds);
}

Console.WriteLine("ReverseGuidEqualityComparer distinct hashes: {0}", hs1.Count);
Console.WriteLine("Guid.GetHashCode() distinct hashes: {0}", hs2.Count);

With sequential Guid the difference in the number of distinct hash codes is staggering:

ReverseGuidEqualityComparer distinct hashes: 1500000
Guid.GetHashCode() distinct hashes: 256

Now... If you don't want to use ToByteArray() (because it allocates useless memory), there is a solution that uses reflection and expression trees... It should work correctly with Mono, because Mono "aligned" its implementation of Guid to the one of Microsoft in 2004, that is ancient time :-)

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    public static readonly Func<Guid, int> GetHashCodeFunc;

    static ReverseGuidEqualityComparer()
    {
        var par = Expression.Parameter(typeof(Guid));
        var hash = Expression.Variable(typeof(int));

        var const23 = Expression.Constant(23);

        var const8 = Expression.Constant(8);
        var const16 = Expression.Constant(16);
        var const24 = Expression.Constant(24);

        var b = Expression.Convert(Expression.Convert(Expression.Field(par, "_b"), typeof(ushort)), typeof(uint));
        var c = Expression.Convert(Expression.Convert(Expression.Field(par, "_c"), typeof(ushort)), typeof(uint));
        var d = Expression.Convert(Expression.Field(par, "_d"), typeof(uint));
        var e = Expression.Convert(Expression.Field(par, "_e"), typeof(uint));
        var f = Expression.Convert(Expression.Field(par, "_f"), typeof(uint));
        var g = Expression.Convert(Expression.Field(par, "_g"), typeof(uint));
        var h = Expression.Convert(Expression.Field(par, "_h"), typeof(uint));
        var i = Expression.Convert(Expression.Field(par, "_i"), typeof(uint));
        var j = Expression.Convert(Expression.Field(par, "_j"), typeof(uint));
        var k = Expression.Convert(Expression.Field(par, "_k"), typeof(uint));

        var sc = Expression.LeftShift(c, const16);
        var se = Expression.LeftShift(e, const8);
        var sf = Expression.LeftShift(f, const16);
        var sg = Expression.LeftShift(g, const24);
        var si = Expression.LeftShift(i, const8);
        var sj = Expression.LeftShift(j, const16);
        var sk = Expression.LeftShift(k, const24);

        var body = Expression.Block(new[]
        {
            hash
        },
        new Expression[]
        {
            Expression.Assign(hash, Expression.Constant(37)),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Field(par, "_a")),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(b, sc), typeof(int))),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(d, Expression.Or(se, Expression.Or(sf, sg))), typeof(int))),
            Expression.MultiplyAssign(hash, const23),
            Expression.AddAssign(hash, Expression.Convert(Expression.Or(h, Expression.Or(si, Expression.Or(sj, sk))), typeof(int))),
            hash
        });

        GetHashCodeFunc = Expression.Lambda<Func<Guid, int>>(body, par).Compile();
    }

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        return GetHashCodeFunc(obj);
    }

    #endregion

    // For comparison purpose, not used
    public int GetHashCodeSimple(Guid obj)
    {
        var bytes = obj.ToByteArray();

        unchecked
        {
            int hash = 37;

            hash = hash * 23 + (int)((uint)bytes[0] | ((uint)bytes[1] << 8) | ((uint)bytes[2] << 16) | ((uint)bytes[3] << 24));
            hash = hash * 23 + (int)((uint)bytes[4] | ((uint)bytes[5] << 8) | ((uint)bytes[6] << 16) | ((uint)bytes[7] << 24));
            hash = hash * 23 + (int)((uint)bytes[8] | ((uint)bytes[9] << 8) | ((uint)bytes[10] << 16) | ((uint)bytes[11] << 24));
            hash = hash * 23 + (int)((uint)bytes[12] | ((uint)bytes[13] << 8) | ((uint)bytes[14] << 16) | ((uint)bytes[15] << 24));

            return hash;
        }
    }
}

Other solution, based on "undocumented but working" programming (tested on .NET and Mono):

public class ReverseGuidEqualityComparer : IEqualityComparer<Guid>
{
    public static readonly ReverseGuidEqualityComparer Default = new ReverseGuidEqualityComparer();

    #region IEqualityComparer<Guid> Members

    public bool Equals(Guid x, Guid y)
    {
        return x.Equals(y);
    }

    public int GetHashCode(Guid obj)
    {
        GuidToInt32 gtoi = new GuidToInt32 { Guid = obj };

        unchecked
        {
            int hash = 37;

            hash = hash * 23 + gtoi.Int32A;
            hash = hash * 23 + gtoi.Int32B;
            hash = hash * 23 + gtoi.Int32C;
            hash = hash * 23 + gtoi.Int32D;

            return hash;
        }
    }

    #endregion

    [StructLayout(LayoutKind.Explicit)]
    private struct GuidToInt32
    {
        [FieldOffset(0)]
        public Guid Guid;

        [FieldOffset(0)]
        public int Int32A;
        [FieldOffset(4)]
        public int Int32B;
        [FieldOffset(8)]
        public int Int32C;
        [FieldOffset(12)]
        public int Int32D;
    }
}

It uses the StructLayout "trick" to superimpose a Guid to a bunch of int, write to the Guid and read the int.

Why does Guid.GetHashCode() has problems with sequential ids?

Very easy to explain: from the reference source, the GetHashCode() is:

return _a ^ (((int)_b << 16) | (int)(ushort)_c) ^ (((int)_f << 24) | _k);

so the _d, _e, _g, _h, _i, _j bytes aren't part of the hash code. When incremented a Guid is first incremented in the _k field (256 values), then on overflow in the _j field (256 * 256 values, so 65536 values), then on the _i field (16777216 values). Clearly by not hashing the _h, _i, _j fields the hash of a sequential Guid will only show 256 different values for non-huge range of Guid (or at maximum 512 different values, if the _f field is incremented once, like if you start with a Guid similar to 12345678-1234-1234-1234-aaffffffff00, where aa (that is "our" _f) will be incremented to ab after 256 increments of the Guid)

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Does a dictionary have a problem with sequential integers because they have similar hash codes? – Rawling Jul 23 '15 at 07:13
  • @Rawling I was writing something wrong... the problem of `Guid.GetHashCode()` is that for sequential ids the number of distinct hashes generated is very small (like 256 for 1.5m of sequential ids) – xanatos Jul 23 '15 at 07:31
  • Ah, that makes more sense. What a terrible hashcode. – Rawling Jul 23 '15 at 07:35
  • Interesting answer. Just one point - the .ToByteArray() in GetHashCode() is expensive, especially if you're putting 1.5m items in a Dictionary. Allocating an array every time the Dictionary wants to compare the hash code of the item will hammer the GC with short lived objects. If there's no other way to do this, at least cache the hash code of the object so it doesn't have to be recalculated (assuming it's immutable). – Niall Connaughton Jul 23 '15 at 07:36
  • 2
    @Rawling It is built for random hashes, not for sequential hashes. – xanatos Jul 23 '15 at 07:36
  • @NiallConnaughton The `Dictionary<,>` "saves" the result of the `GetHashCode()` of added objects, so when you search for an item, a single call to `GetHashCode()` is ever done: the one for the searched item. Without reflection you can't directly access the internal bytes of a `Guid`. – xanatos Jul 23 '15 at 07:37
  • Fair enough - you're still guaranteeing at least 1.5 million byte arrays to be created here. It's still a general performance red flag for any heap allocations to be done in GetHashCode, especially if you know the number of items will be high. While the Guid GetHashCode is vulnerable to this issue with sequential guids, if you look at its implementation, there is no allocation. – Niall Connaughton Jul 23 '15 at 07:40
  • @NiallConnaughton Yep... If you want, I can make a reflection-based `Equality` comparer... but it would be even worse :-) reflecting on private fields is bad. – xanatos Jul 23 '15 at 07:42
  • Indeed, so perhaps a different solution altogether is warranted... @RBasniak - if the guids are being allocated in a DB, is it possible to use a sequential integer id instead? I expect things will run much more smoothly then. – Niall Connaughton Jul 23 '15 at 07:44
  • @NiallConnaughton Mmmmh... Mono uses the same private fields from this [commit](https://github.com/mono/mono/commit/8b96bfb6345902b3dcd9a350d6c0d190ea8d8835), that is 2004-05-31... Reflection isn't so much bad then... – xanatos Jul 23 '15 at 07:46
  • 1
    @NiallConnaughton Added an Expression-tree (so a Reflection based) solution, for the daredevils. – xanatos Jul 23 '15 at 08:28
  • @NiallConnaughton Would be perfect, but I have no control on the database. It's from a third party application. – rbasniak Jul 23 '15 at 13:10
  • @xanatos Thank you for actually showing the source of the problem, instead of an alternate solution. Not that it's bad or something. All suggestions that I implemented made me learn something. The database has many sequential GUIDs because they represent the type of the object, and only the last bytes are always changing. In the end I implemented a class like Blindy suggested, but with your hashing code. Now it takes less than 10sec, which is perfectly acceptable for my needs. – rbasniak Jul 23 '15 at 13:13
  • 1
    You shouldn't need an extra Guid wrapper type, you should be fine using the standard guid and passing one of the EqualityComparers above to the Dictionary. @xanatos - interesting approach with the StructLayout, etc... Can only upvote your answer once, sorry, but have learned a few things :P – Niall Connaughton Jul 24 '15 at 02:30
4

I'm not, the dictionary Key is the ID property of the Element class, and not the Element class itself. This property is of type System.Guid.

The problem with Guid is that it's a very specialized construct. For one thing it's a struct, not a class. Moving this thing around isn't as simple as moving a pointer (technically a handle, but same thing), it involves moving the entire memory block around. Keep in mind the .NET memory model makes everything be compact, so that also involves moving other blocks around to make room.

Also looking at the source code, it stores all the parts as separate fields, 11 of them! That's a lot of comparisons for 1.5 million entries.

What I'd do is create a sort of alternate Guid implementation (class, not struct!) tailor made for efficient comparisons. All the fancy parsing isn't needed, just focus on speed. Guids are 16 bytes in length, that means 4 long fields. You need to implement Equals as usual (compare the 4 fields) and GetHashCode as something like XORing the fields. I'm certain that's good enough.

Edit: Note that I'm not saying the framework-provided implementation is bad, it's just not made for what you're trying to do with it. In fact it's terrible for your purpose.

Blindy
  • 65,249
  • 10
  • 91
  • 131
  • Your solution has a lot of good points. I'll give a try and see how the performance goes. Could you please give me some more insight on the XORing part that you suggested? I tried to create my own class before using the GUID, but it performed very slowly because of my hashcode, I was just suming the individual bytes. – rbasniak Jul 22 '15 at 23:10
  • 1
    Summing 11 values won't do much for you over comparing 11 values (tia's solution). What I mean is using only 4 word-sized values instead (64 bits, ie `long`). It's the same data but represented in a way modern CPUs can easily fetch it. As for the hashcode, simply return `val1^val2^val3^val4`. Xor is preferred over summing because it doesn't "lose" bits, making for a better hashing tool (another popular one is multiplying by a prime and adding the next -- that's what `Math.Random` does). – Blindy Jul 23 '15 at 02:44
  • GUID is two `long` in size, not four. You can use `System.Numerics.Vector` if you want a fast comparison. – GeirGrusom Jul 23 '15 at 08:24
2

If your data is pre-sorted, You can use List<T>.BinarySearch to quickly search in the list. You will need to create a comparer class, and use it for looking up.

class ElementComparer : IComparer<Element>
{
    public int Compare(Element x, Element y)
    {
        // assume x and y is not null
        return x.ID.CompareTo(y.ID);
    }
}

Then use it

var comparer = new ElementComparer();
var elements = new List<Element>(1500000); // specify capacity might help a bit

//... (load your list here. Sort it with elements.Sort(comparer) if needed)

Guid guid = elements[20]; // for the sake of testing
int idx = elements.BinarySearch(new Element { ID = guid }, comparer);

You can wrap this whole thing in IReadOnlyDictionary<Guid, Element> if you want, but maybe you don't need it this case.

tia
  • 9,518
  • 1
  • 30
  • 44
  • 1
    This will be a lot slower than what he has. You're completely ignoring the gains from hashing values as opposed to comparing 11 values over and over and over again. – Blindy Jul 22 '15 at 15:38
  • I think the above solution is fast enough for the job. According to how `Guid.CompareTo` work described in the documentation, most GUID pair will end the comparison after first `int` unmatched. This might be slower than hash seek, but not 'a lot' slower. – tia Jul 22 '15 at 15:58