c# HashSet init takes too long

Question

I'm dealing with the fact that I need to init a HashSet with a set of elements but without any kind of comparation class.

After the init, any element added to the HashSet need to be passed with a comparator.

How can I accomplish it?

Now I have this:

HashSet<Keyword> set = new HashSet<Keyword>(new KeyWordComparer());

The problem is that the init takes to long and there's no necessity in applying the comparation.

KeywordComparer Class:

class KeyWordComparer : EqualityComparer<Keyword>
{
    public override bool Equals(Keyword k1, Keyword k2)
    {
        int equals = 0;
        int i = 0;
        int j = 0;

        // based on sorted ids
        while (i < k1._lista_modelos.Count && j < k2._lista_modelos.Count)
        {
            if (k1._lista_modelos[i] < k2._lista_modelos[j])
            {
                i++;
            }
            else if (k1._lista_modelos[i] > k2._lista_modelos[j])
            {
                j++;
            }
            else
            {
                equals++;
                i++;
                j++;
            }
         }

         return equals >= 8;
    }
    public override int GetHashCode(Keyword keyword)
    {
        return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.
    }
}

Note: This is a follow-up question to c# comparing list of IDs.

"After the init, any element added to the HashSet need to be passed with a comparator." That's not how HashSet works. What would you expect to happen if a different comparer were used for every element? It's really not clear what you're trying to achieve here, or exactly what's taking too long - the line you've provided won't take any (significant) time at all. If you're *actually* adding a lot of items to the set after constructing it, then yes, that can take time - but we can't see why without a [mcve]. — Jon Skeet, Nov 19 '17 at 14:07
@JonSkeet Basically, May I use HashSet set = new HashSet(); and after the init, apply the KeywordComparer class? — Apalabrados, Nov 19 '17 at 14:13
No. A `HashSet` needs a comparer from the start - either the default one for the type, or a user-specified one. We still don't know what's taking so long though. The `HashSet` constructor is cheap - does your `KeywordComparer` constructor take a long time? (Again, if you'd just provide a [mcve] we'd be in a much better position to help you.) — Jon Skeet, Nov 19 '17 at 14:14
What are you calling “init”. The creation of the new instance or actually loading the data? — InBetween, Nov 19 '17 at 14:15
@JonSkeet If I use new keywordComparer() and init the hash with 90.000 elements, it last more than hour and in case I leave it empty, it takes only 10 seconds. That's my problem. — Apalabrados, Nov 19 '17 at 14:17
Well that sounds like your comparer is inefficient then. But it's entirely necessary. But *yet again*, we can't help you make it more efficient without seeing the code that's being slow. (It also doesn't help that your question shows the code that *isn't* slow, which could easily be considered as the "init" part, even though that's not what you mean.) — Jon Skeet, Nov 19 '17 at 14:17
You should post your `KeyWordComparer` class and perhaps we can help you. — Magnus, Nov 19 '17 at 14:18
And now at least *one* problem is blindingly obvious - it's already in a comment: `return 0;//notice that using the same hash for all keywords gives you an O(n^2) time complexity though.` Why are you returning such a terrible hash code? (And as rene says, your `Equals` method could be more efficient - and may well be broken, although it's hard to tell without knowing what you're trying to achieve.) — Jon Skeet, Nov 19 '17 at 14:22
I strongly suspect that whatever equality relation you're attempting to implement violates the requirements of equality that HashSet and many other .NET collections rely on. — Jon Skeet, Nov 19 '17 at 14:23
That’s 90.000 collisions right there hammering `Equals`, which seems improvable,, more than what would be advisable. I suspect your class is not a very good candidate for hash sets — InBetween, Nov 19 '17 at 14:24
@rene Yes, In case the new keyword has more than 8 elements equals to any of the HastSet it's not added. — Apalabrados, Nov 19 '17 at 14:27
To expand on what Jon has said the whole point of a hashcode is so that the comparer doesn't have to run the equality code every time it wants to compare two things. You implement a hashcode and the comparer checks if they are equal and only if they are does it run the equality code. With your hashcode implementation it has to run the relatively heavy looking equals every time it wants to see if two things are equal. — Chris, Nov 19 '17 at 14:27
So in your case with 90,000 elements the first element added just goes in. The second needs one comparison to see if it is the same as first. The third element needs to run equals against two elements. Your 90,000th element will be running equality against the existing 89,999 elements... So my back of the envelope calculations suggest you are running your equals method about four billion times.... — Chris, Nov 19 '17 at 14:30
"Yes, In case the new keyword has more than 8 elements equals to any of the HastSet it's not added." That sounds like it's a violation of equality. Consider keywords A, B and C. Suppose B has 16 elements - 8 from A and 8 from C. Then A.Equals(B) will be true, and B.Equals(C) will be true, but A.Equals(C) will be false. That's a violation of the transitive property of equality. — Jon Skeet, Nov 19 '17 at 14:32
@Chris Exactly, that's what I want to avoid when I first init the hastset. Once it's initied the hashset (loaded from database) the next items need to checked with the comparer — Apalabrados, Nov 19 '17 at 14:33
@JonSkeet the 90.000 elements does not violates anything due to they had been checked before and saved in a database. Only new ones must be checked. — Apalabrados, Nov 19 '17 at 14:35
Given that you've got a broken equality relation to start with, I don't think I'd want to take that as a given. If you can come up with a working equality relation, you can probably come up with a decent hash code as well - at which point it will be efficient. I would worry about the correctness before efficiency. (Aside from anything else, with your current hash code there is *no point whatsoever* in using a `HashSet`. You might as well just use a `List`. The whole point of a hash-based collection is that it can efficiently determine distinct elements via the hash code.) — Jon Skeet, Nov 19 '17 at 14:37
Just describe what problem you are really solving, because at this point it's clear that method you've chosen for that is not the best one. — Evk, Nov 19 '17 at 14:39
@Evk well, every keyword has 20 IDs, so when I want to add a new Keyword to the HashSet, the KeywordComparer check that the new one does not have more than 8 ID's repeated with any keyword of the HashSet.In such case, new keyword is not included, if not, it's included. — Apalabrados, Nov 19 '17 at 14:44
Consider the situation I described before with A, B and C. If you add B first, you couldn't add A or C (you'd just end up with B). If you add them in the order A, B, C you'd end up with A and C. Is that really what you want? It's pretty odd. — Jon Skeet, Nov 19 '17 at 14:53
To the other commenters: The solution in this question is based on the answer OP got on [this other question](https://stackoverflow.com/questions/46880031/c-sharp-comparing-list-of-ids). It’s imho really unfortunate that this became the accepted solution there. — poke, Nov 19 '17 at 15:43

score 1 · Answer 1 · answered Nov 19 '17 at 14:59

every keyword has 20 IDs, so when I want to add a new Keyword to the HashSet, the KeywordComparer check that the new one does not have more than 8 ID's repeated with any keyword of the HashSet.In such case, new keyword is not included, if not, it's included.

Collecting these keywords is not a job for a hash set here. A hash set is generally not suited for items which depend on other elements of the set. You should only use it for things where a useful hash can be calculated for every item. Since it depends on the existing set of items whether a new item gets added to your set, this is totally the wrong tool.

Here’s an attempt to solve this problem according to your short description of what you actually want to do. Here, we are simply collecting the keywords in a list. In order to verify that they may be added, we use an addition hash set to collect the ids of the keywords. That way, we can quickly check for a new item, whether 8 or more of its ids are already contained within the list of keywords.

var keywords = new List<Keyword>();
var selectedIds = new HashSet<int>(); // I’m assuming that the ids are ints here

foreach (var keyword in GetListOfAllKeywords())
{
    // count the number of keyword ids that are already in the selectedIds set
    var duplicateIdCount = keyword.Ids.Count(id => selectedIds.Contains(id));

    if (duplicateIdCount <= 8)
    {
        // less or equal to 8 ids are already selected, so add this keyword
        keywords.Add(keyword);

        // and collect all the keyword’s ids
        selectedIds.AddRange(keyword.Ids);
    }
}

rene · Accepted Answer · 2017-11-19T16:50:54.377

If I stay away from the fact if using the HashSet is the right type for the job at hand or if your Comparer even makes sense implementing a proper GetHashCode does seem to make a huge difference.

Here is an example implementation, based on an answer from Marc Gravell:

class KeyWordComparer : EqualityComparer<Keyword>
{
    // omitted your Equals implentaton for brevity

    public override int GetHashCode(Keyword keyword)
    {
        //return 0; // this was the original
        // Marc Gravell https://stackoverflow.com/a/371348/578411
        int hash = 13;
        // not sure what is up with the only 8 ID's but I take that as a given
        for(var i=0; i < Math.Min(keyword._lista_modelos.Count, 8) ; i++)
        {
            hash = (hash * 7) + keyword._lista_modelos[i].GetHashCode();
        }
        return hash;
    }
}

When I run this in LinqPad with this test rig

   Random randNum = new Random();
   var kc = new KeyWordComparer();
   HashSet<Keyword> set = new HashSet<Keyword>(kc);
   var sw = new Stopwatch();
   sw.Start();
   for(int i =0 ; i< 10000; i++)
   {
     set.Add(new Keyword(Enumerable
            .Repeat(0, randNum.Next(1,10))
            .Select(ir => randNum.Next(1, 256)).ToList()));
   }
   sw.Stop();
   sw.ElapsedMilliseconds.Dump("ms");

this is what I measure:

7 ms for 10,000 items

If I switch back to your return 0; implementation for GetHashCodeI measure

4754 ms for 10,000 items

If I increase the testloop to insert 100,000 items the better GetHashCode still completes in 224 ms on my box. I didn't wait for your implementation to finish.

So if anything implement a proper GetHashCode method.

c# HashSet init takes too long

2 Answers2

Linked