How to create a HashSet> with distinct elements?

Question

I have a HashSet that contains multiple lists of integers - i.e. HashSet<List<int>>

In order to maintain uniqueness I am currently having to do two things: 1. Manually loop though existing lists, looking for duplicates using SequenceEquals. 2. Sorting the individual lists so that SequenceEquals works currently.

Is there a better way to do this? Is there an existing IEqualityComparer that I can provide to the HashSet so that HashSet.Add() can automatically handle uniqueness?

var hashSet = new HashSet<List<int>>();

for(/* some condition */)
{
    List<int> list = new List<int>();

    ...

    /* for eliminating duplicate lists */

    list.Sort();

    foreach(var set in hashSet)
    {
        if (list.SequenceEqual(set))
        {
            validPartition = false;
            break;
        }
    }

    if (validPartition)
           newHashSet.Add(list);
}

Check Jon Skeet's answer here: http://stackoverflow.com/questions/1023424/how-to-use-comparer-for-a-hashset/1023475#1023475 — Forgotten Semicolon, Apr 01 '11 at 19:34
Could you provide more info on what problem you are actually trying to solve? `HashSet>` seems like an unlikely tool to use. — marcind, Apr 01 '11 at 20:19
@marcind, I am using it to maintain a list of all factorizations of a number.. so for 24 you can have for example {4, 2, 3} , {2, 2, 6} etc... the algorithm I use at the moment creates duplicate sets, I wish I knew how to solve THAT problem, but sadly I dont :-/ — Preets, Apr 01 '11 at 20:51
You might want to ask that as a separate question. There should be a more elegant solution than what you're currently trying. — CodesInChaos, Apr 01 '11 at 20:55
@CodeInChaos, yup, I definitely think I should ! Any solution will be more elegant than the mess I've got going at the moment ;-) — Preets, Apr 01 '11 at 20:59
@Preets - this is a bit off topic, but what namespace are you getting HasSet from. I would think it would be in System.Collections.Generic but I have `using System.Collections.Generic` and which I cna use List, it is yelling at me for using HashSet... — kralco626, Apr 07 '11 at 11:23
@kralco626, I am using System.Collections.Generic.HashSet. I am afraid I don't quite get your question.. is the compiler asking you to use a List instead of a HashSet ? — Preets, Apr 07 '11 at 20:46
@Preets - no, it's just trying to tell me that HashSet does not exsist in System.Collections.Generic... — kralco626, Apr 08 '11 at 10:52

score 8 · Accepted Answer · edited Jan 11 '22 at 10:24

8

Here is a possible comparer that compares an IEnumerable<T> by its elements. You still need to sort manually before adding.

One could build the sorting into the comparer, but I don't think that's a wise choice. Adding a canonical form of the list seems wiser.

This code will only work in .net 4 since it takes advantage of generic variance. If you need earlier versions you need to either replace IEnumerable with List, or add a second generic parameter for the collection type.

class SequenceComparer<T>:IEqualityComparer<IEnumerable<T>>
{
    public bool Equals(IEnumerable<T> seq1,IEnumerable<T> seq2)
    {
        return seq1.SequenceEqual(seq2);
    }
    
    public int GetHashCode(IEnumerable<T> seq)
    {
        int hash = 1234567;
        foreach(T elem in seq)
            hash = unchecked(hash * 37 + elem.GetHashCode());
        return hash;
    }
}

void Main()
{
    var hashSet = new HashSet<List<int>>(new SequenceComparer<int>());

    List<int> test=new int[]{1,3,2}.ToList();
    test.Sort();
    hashSet.Add(test);

    List<int> test2=new int[]{3,2,1}.ToList();
    test2.Sort();       
    hashSet.Contains(test2).Dump();
}

edited Jan 11 '22 at 10:24

Zoe

27,060
21
118
148

answered Apr 01 '11 at 20:29

CodesInChaos

106,488
23
218
262

@downvoter Could you explain what the problem with this solution is, so I can fix/improve it? – CodesInChaos Apr 01 '11 at 20:38
(not my downvote) Pretty close but missing the sort either before the add or in Equals – Rune FS Apr 01 '11 at 20:39
That's why I explained in the beginning that he still needs to sort manually. – CodesInChaos Apr 01 '11 at 20:41
@CodeInChaose I agree with the design choice you outline. Seems a waste to sorte all the already sorted lists. It was a comment to your (now updated) Main(). (and to me that wasn't worth downvoting for in the first place) – Rune FS Apr 01 '11 at 20:47
Having the `.Sort()` in `Main` illustrates better how to use it. So that was a good suggestion :) – CodesInChaos Apr 01 '11 at 20:50
@CodeInChaos Thanks ! This worked just fine. Though I am a bit confused about the Equals implementation. Not sure if I am making sense, but why don't we simply compare hash codes in equals ? Is calling SequenceEqual in Equals() essentially the same ? – Preets Apr 01 '11 at 21:41
@CodeInChaose The sort is worst case O(n2) the Equals is worst case O(n) and the GetHashCode always O(n), is this really the best way? – Magnus Apr 01 '11 at 21:52
@Magus The sort is O(n*log(n)) average case, and you can't get below O(n) when adding an arbitrary list anyways. The hash code will typically calculated once, and the equals will be calculated as often as the hash code collides inside the table, which is O(1). There are some possible improvements, but I doubt they're necessary in practice. I believe this is a good compromise between code readability/size and performance. – CodesInChaos Apr 01 '11 at 22:03
Going with this one, as maintaining a new data structure for storing hash codes (though more efficient) seems like an overkill for my application – Preets Apr 01 '11 at 22:55
@CodeInChaos - I had just assumed that when adding something to a HashSet, each existing element's hashcode would be checked directly when you add something else. That is not the case (I confirmed with a quick test) so you're correct when you say each hash code should only be checked once. So basically my answer is overkill. At the same time it's interesting, and a little troubling, since it means a `HashSet` can easily be corrupted - changing a member from an external reference, then inserting a member with a hashcode that matches the old one for the member you changed will fail. – Jamie Treworgy Apr 01 '11 at 22:55
The contract of a hashset/dictionary explicitly states that neither the equality nor the hashcode may change while an object is in a set. Typically you only override equals/hashcode on immutable objects. And even if hashset didn't store the hash it would be corrupted by a changing hash, since the hash determines the bucket, so with a changed hash it's look into the wrong bucket. – CodesInChaos Apr 02 '11 at 07:54
Hi, how can i make this to work with double? i need this! – Dang D. Khanh Jul 13 '20 at 06:06

score 7 · Answer 2 · answered Apr 01 '11 at 20:26

7

This starts off wrong, it has to be a HashSet<ReadOnlyCollection<>> because you cannot allow the lists to change and invalidate the set predicate. This then allows you to calculate a hash code in O(n) when you add the collection to the set. And an O(n) test to check if it is already in the set with a very uncommon O(n^2) worst case if all the hashes turn out to be equal. Store the computed hash with the collection.

answered Apr 01 '11 at 20:26

Hans Passant

922,412
146
1,693
2,536

It's not like `ReadOnlyCollection` guarantees immutability. And if this set isn't exposed in a public API mutability doesn't matter. Storing the computed hash isn't that important either, since I think `HashSet` already stores the hashes for the elements it already contains. – CodesInChaos Apr 01 '11 at 20:34
A ReadOnlyCollection does. Whether to store it or to create a derived class that overrides Equals+GetHashCode is up to the OP. – Hans Passant Apr 01 '11 at 20:41
1

What I meant is that if you don't create the ReadOnlyCollection yourself an outsider still has the reference to the underlying IList and can change that list, which is then reflected in the ReadOnlyCollection. If you control the creation of the ReadOnlyCollection you can guarantee (shallow) immutability. (And on int deep immutability) – CodesInChaos Apr 01 '11 at 20:46

Jamie Treworgy · Answer 3 · 2011-04-01T21:14:00.033

Is there a reason you aren't just using an array? int[] will perform better. Also I assume the lists contain duplicates, otherwise you'd just be using sets and not have a problem.

It appears that their contents won't change (much) once they've been added to the HashSet. At the end of the day, you are going to have to use a comparer that falls back on SequenceEqual. But you don't have to do it every single time. Instead or doing an exponential number of sequence compares (e.g. -- as the hashset grows, doing a SequenceEqual against each existing member) -- if you create a good hashcode up front, you may have to do very few such compares. While the overhead of generating a good hashcode is probably about the same as doing a SequenceEqual you're only doing it a single time for each list.

So, the first time you operate on a particular List<int>, you should generate a hash based on the ordered sequence of numbers and cache it. Then the next time the list is compared, the cached value can be used. I'm not sure how you might do this with a comparer off the top of my head (maybe a static dictionary?) -- but you could implement List wrapper that does this easily.

Here's a basic idea. You'd need to be careful to ensure that it isn't brittle (e.g. make sure you void any cached hash code when members change) but it doesn't look like that's going to be a typical situation for the way you're using this.

public class FasterComparingList<T>: IList<T>, IList, ... 
    /// whatever you need to implement
{
   // Implement your interfaces against InnerList
   // Any methods that change members of the list need to
   // set _LongHash=null to force it to be regenerated
   public List<T> InnerList { ... lazy load a List }
   public int GetHashCode()
   {
       if (_LongHash==null) {
           _LongHash=GetLongHash();
       }
       return (int)_LongHash;
   }
   private int? _LongHash=null;
   public bool Equals(FasterComparingList<T> list)
   {
       if (InnerList.Count==list.Count) {
           return true;
       }
       // you could also cache the sorted state and skip this if a list hasn't
       // changed since the last sort
       // not sure if native `List` does
       list.Sort();
       InnerList.Sort();
       return InnerList.SequenceEqual(list);
   }
   protected int GetLongHash()
   {
       return .....
       // something to create a reasonably good hash code -- which depends on the 
       // data. Adding all the numbers is probably fine, even if it fails a couple 
       // percent of the time you're still orders of magnitude ahead of sequence
       // compare each time
   } 
}

If the lists won't change once added, this should be very fast. Even in situations where the lists could change frequently, the time to create a new hash code is not likely very different (if even greater at all) than doing a sequence compare.

No particular reason to use List<>, I did not know about int[]s performing better. Thanks! And your assumption is correct, the lists do include duplicates, which is why I do not use sets. — Preets, Apr 01 '11 at 22:28
Generally a simpler construct is probably going to be faster than a more complex one, unless you're doing something that depends on some aspect of that complex structure (e.g. a linked list will be much faster to insert items than a non-linked list). The long and short of my long winded response is that you should use a construct that can cache hash codes. Since it's expensive to compare lists or create something that can uniquely identify one, and you're doing it lots of times on the same objects, just set something up that will remember that unique ID. — Jamie Treworgy, Apr 01 '11 at 22:34

BrandonZeider · Answer 4 · 2011-04-01T20:14:00.163

0

If you don't specify an IEQualityComparer, then the types default will be used, so I think what you'll need to do is create your own implementation of IEQualityComparer, and pass that to the constructor of your HashSet. Here is a good example.

edited Apr 01 '11 at 20:14

answered Apr 01 '11 at 19:38

BrandonZeider

8,014
2
23
20

score 0 · Answer 5 · answered Oct 30 '19 at 20:58

When comparing hashsets of lists one option you always have is that instead of comparing each element, you sort lists and join them using a comma and compare generated strings.

So, in this case, when you create custom comparer instead of iterating over elements and calculating custom hash function, you can apply this logic.

How to create a HashSet> with distinct elements?

5 Answers5

Linked

Related