Efficient way to clone a HashSet?

Question

A few days ago, I answered an interesting question on SO about HashSet<T>. A possible solution involved cloning the hashset, and in my answer I suggested to do something like this:

HashSet<int> original = ...
HashSet<int> clone = new HashSet<int>(original);

Although this approach is quite straightforward, I suspect it's very inefficient: the constructor of the new HashSet<T> needs to separately add each item from the original hashset, and check if it isn't already present. This is clearly a waste of time: since the source collection is a ISet<T>, it is guaranteed not to contain duplicates. There should be a way to take advantage of that knowledge...

Ideally, HashSet<T> should implement ICloneable, but unfortunately it's not the case. I also checked with Reflector to see if the HashSet<T> constructor did something specific if the source collection was a hashset, but it doesn't. It could probably be done by using reflection on private fields, but that would be an ugly hack...

So, did someone come up with a clever solution to clone a hashset more efficiently ?

(Note that this question is purely theoretical, I don't need to do that in a real program)

hm, good question, just curious though, what are the theoretical inefficiencies we are concerned about? i'm rusty on my order notation for abstract data types, but wouldn't a check for existence within the target hash set be a simple O(1) collision test? i agree from an informational perspective it could be "better" but can we put a bound on it, and would it be significant? — johnny g, Oct 13 '10 at 20:45
I suspect they don't have a HashSet(ISet) constructor is because any class could implement ISet, perhaps badly; which means the presence of ISet is no guarantee that there are no duplicates — Steve Ellinger, Oct 13 '10 at 20:46
@Steve Ellinger, you're probably right. However, they could have provided a HashSet(HashSet) constructor... — Thomas Levesque, Oct 13 '10 at 21:00
Actually, what I am curious about is why they didn't implement ICloneable, is it because any implementation would be no more efficient then the constructor you ended up calling in your referred to answer; therefore, why bother when the functionality is already available. The same could possibly be said for your copy constructor. Course this doesn't seem plausible given your comment about about 'and check if it isn't already present'. Hmmm. — Steve Ellinger, Oct 13 '10 at 21:08
Even the deserializer makes no assumptions and uses AddIfNotPresent(). Good idea, the culture might have changed. This is a no-go. Question the need to clone first. Expensive operations should be, well, expensive. Great API design. — Hans Passant, Oct 13 '10 at 21:12
Would you happen to know of any penalties of using serialization in general? I tested this comparing using the constructor vs serializing and the constructor was nearly 2x faster on average without verification (3x with verification) on a 10000 item set. With larger sets, the difference is reduced with constructor still faster. I can post the code if you'd like. — Jeff Mercado, Oct 13 '10 at 21:17
@johnny g, I just did a small test: clone using reflection vs. constructor call. Even with the overhead it implies, reflection is roughly twice as fast. So I guess a real Clone method would be much faster... — Thomas Levesque, Oct 13 '10 at 21:23
@Hans Passant, good remarks. Regarding the need to clone a hashset: as I said, the question is purely theoretical, I don't need to do it. — Thomas Levesque, Oct 13 '10 at 21:24
@Jeff M, the idea of using serialization crossed my mind, but I means that (1) the items must to be serializable, and (2) each item will be cloned... My idea was to make a shallow copy of the hashset, not a deep copy. — Thomas Levesque, Oct 13 '10 at 21:26
They didn't implement ICloneable because it's a lousy interface. http://blogs.msdn.com/b/brada/archive/2003/04/09/49935.aspx is one reason. — Bill Wert - MSFT, Oct 14 '10 at 23:40
why would you think they dont have some efficient way of implementing the clone via the constructor? If I was writing the code, id check the input, and if i knew the object passed was a HashSet, then i could write a specific,faster,method to do the copy,since i can infer information from this, and also, access the privates of the input other HashSet, otherwise just do a 'slow' copy — jasper, Oct 29 '10 at 21:40
@jasper: I checked the code with Reflector, there is no such optimization. — Thomas Levesque, Oct 29 '10 at 21:58
@Thomas Levesque: not a surprise i suppose, but its certainly an optimization that could be done. as you said above its fairly inefficient to recompute something you already have access to, and i cant think of any reason not to. then again, if your going to do that, it would be better to provide a HashSet(HashSet) constructor as you said. — jasper, Oct 30 '10 at 02:15

score 13 · Accepted Answer · edited Nov 07 '14 at 18:14

13

If you really wanted the most efficient way to clone a HashSet<T>, you'd do the following (but possibly at the cost of maintainability)

Use reflector or the debugger to figure out exactly what fields in HashSet<T> need to be copied. You may need to do this recursively for each field.
Use Reflection.Emit or use expression trees to generate a method which does the necessary copying of all of the fields. May need to call other generated methods which copy the value of each field. We're using runtime code generation because it's the only way to directly access private fields.
Use FormatterServices.GetUninitializedObject(...) to instantiate a blank object. Use the method generated in step 2 to copy the original object to the new blank object.

edited Nov 07 '14 at 18:14

BartoszKP

34,786
15
102
130

answered Nov 04 '10 at 18:47

jthg

2,790
3
29
32

Forgot to mention (the obvious optimization) that you'd want to cache the generated method and reuse it for all cloning operations. – jthg Nov 04 '10 at 18:52
Oops, missed the part where you called this an "ugly hack". It shouldn't be _too_ ugly if you use expression tress rather than Reflection.Emit. Of course, the tight dependency on the implementation details of HashSet might make it ugly if MS decides to tweak HashSet. – jthg Nov 04 '10 at 18:58
I don't like the idea of reflection on private members... but unless Microsoft implements a proper copy constructor, I agree that it's probably the most efficient way to do it. – Thomas Levesque Nov 04 '10 at 20:05
Since the source code [is now available](http://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs,831ca3e6d9c0d78b), I was wondering if someone has an implementation which constructs with only an existing HashSet of the same type, initializes with original capacity and doesn't use `AddWithPresent` to add. I don't get why nothing has been done about this. – Chibueze Opata Sep 14 '15 at 16:08
@jthg I've never seen someone use `FormatterServices.GetUnintializedObject(...)` for something other than serialization... epic usage of a little known method! – mhand Mar 17 '16 at 16:13

score 10 · Answer 2 · answered Aug 28 '18 at 18:01

I checked the .NET Framework source code for both version 4.5.2 and version 4.7.2. Version 4.7.2 does have optimization in the constructor to handle when the passed in collection is of type HashSet, using some internal cloning logic. You would need to also pass in the comparer into the constructor for this logic to work. Version 4.5.2 does NOT have this optimization it seems.

Example:

var clonedSet = new HashSet(set, set.Comparer);

Yes, it's been added in 4.6 I think. – Thomas Levesque Aug 29 '18 at 08:47 — Thomas Levesque, Aug 29 '18 at 08:47

Bas Smit · Answer 3 · 2011-05-24T11:39:47.143

EDIT: After closer inspection this does not seems to be a good idea, with less then 60 elements in the original hashset the method below appears to be slower then just creating a new hashset.

DISCLAIMER: this seems to work but use at your own risk, if you are going to serialize the cloned hashsets you probably want to copy SerializationInfo m_siInfo.

I also faced this problem and took a stab at it, below you will find an extension method that uses FieldInfo.GetValue and SetValue to copy the required fields. It is faster than using HashSet(IEnumerable), how much depends on the amount of elements in the original hashset. For 1,000 elements the difference is about a factor 7. With 100,000 elements its about a factor 3.

There are other ways which may be even faster, but this has gotten rid of the bottleneck for me for now. I tried using expressiontrees and emitting but hit a roadblock, if I get those to work Ill update this post.

using System;
using System.Collections.Generic;
using System.Reflection;
using System.Runtime.Serialization;

public static class HashSetExtensions
{
    public static HashSet<T> Clone<T>(this HashSet<T> original)
    {
        var clone = (HashSet<T>)FormatterServices.GetUninitializedObject(typeof(HashSet<T>));
        Copy(Fields<T>.comparer, original, clone);

        if (original.Count == 0)
        {
            Fields<T>.freeList.SetValue(clone, -1);
        }
        else
        {
            Fields<T>.count.SetValue(clone, original.Count);
            Clone(Fields<T>.buckets, original, clone);
            Clone(Fields<T>.slots, original, clone);
            Copy(Fields<T>.freeList, original, clone);
            Copy(Fields<T>.lastIndex, original, clone);
            Copy(Fields<T>.version, original, clone);
        }

        return clone;
    }

    static void Copy<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, field.GetValue(source));
    }

    static void Clone<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
    {
        field.SetValue(target, ((Array)field.GetValue(source)).Clone());
    }

    static class Fields<T>
    {
        public static readonly FieldInfo freeList = GetFieldInfo("m_freeList");
        public static readonly FieldInfo buckets = GetFieldInfo("m_buckets");
        public static readonly FieldInfo slots = GetFieldInfo("m_slots");
        public static readonly FieldInfo count = GetFieldInfo("m_count");
        public static readonly FieldInfo lastIndex = GetFieldInfo("m_lastIndex");
        public static readonly FieldInfo version = GetFieldInfo("m_version");
        public static readonly FieldInfo comparer = GetFieldInfo("m_comparer");

        static FieldInfo GetFieldInfo(string name)
        {
            return typeof(HashSet<T>).GetField(name, BindingFlags.Instance | BindingFlags.NonPublic);
        }
    }
}

supercat · Answer 4 · 2010-10-15T23:06:05.443

0

Easy pattern which ~~should~~ won't work for many collections:

Class cloneableDictionary(Of T, U)
    Inherits Dictionary(Of T, U)
    Function clone() As Dictionary(Of T, U)
        Return CType(Me.MemberwiseClone, cloneableDict(Of T, U))
    End Function
End Class

Unfortunately, I don't know that Microsoft did anything to prevent calling MemberwiseClone in places where it shouldn't be called (e.g. declaring something other than a method--like maybe a class--with the name MemberwiseClone) so I don't know how one can tell whether such an approach is likely to work.

I think there's a fair reason for a standard collection not to support a public cloning method but only a protected one: it's possible that a class which derives from a collection might break severely if cloned, and if base class' cloning method is public there's no way to prevent an object of a derived class from being given to code that expects to clone it.

That having been said, it would have been nice if .net included cloneableDictionary and other such classes as standard types (though obviously not implemented essentially as above).

edited Oct 15 '10 at 23:06

answered Oct 15 '10 at 22:18

supercat

77,689
9
166
211

1

This won't work... It does a shallow copy, which is what I (kind of) want, but it's *too* shallow: most collections internally use arrays to store the items and/or buckets, and MemberwiseClone will create a copy of the collection *with the same array instance*. So the clones won't be independant copies: if I modify one of the collection, the other one will be affected too, and will become corrupted, which is worse ! – Thomas Levesque Oct 15 '10 at 22:53
Note edits above. Probably worth keeping as an answer, to dissuade anyone else who might come up with the same "solution". BTW, it's too bad Microsoft didn't make "BaseClone" a protected method whose default implementation would be a memberwise clone, and define a standard means for disabling it (e.g. shadowing it with something else called BaseClone that isn't a method). – supercat Oct 15 '10 at 23:09
@Thomas Levesque: Really an embarrassing mistake, especially since I was just trying to figure out the right pattern for clonable objects. As soon as I saw your first post, I knew immediately that I'd oopsed. It seems a lot of people seem to prefer the notion of a copy constructor, but in general a copy constructor is a poor substitute for a clone method, since the type of the object created by a copy constructor will be the type of the constructor, rather than the type of the object being copied. Maybe I'll post my proposed cloning pattern to a blog and link to it. – supercat Oct 16 '10 at 01:11
@Thomas Levesque: How do you like the cloning pattern at http://supercatnet.blogspot.com/2010/10/thoughts-about-object-cloning.html ? The method for sealing methods descendant classes shouldn't call seems a little icky, but workable; is there a better method? Is there any way for a derived class to mess things up without using Reflection? Should I post that pattern as an "is this a good pattern" question? – supercat Oct 16 '10 at 02:32

score -1 · Answer 5 · answered Nov 03 '10 at 18:43

-1

O(n) clone is as good as it can get, theoretically, to clone two sets that won't share the same underlying data structure.

Checking whether or not an element is in a HashSet should be a constant time (i.e. O(1)) operation.

So you could create a wrapper that would just wrap an existing HashSet and hold on to any new additions, but that seems pretty perverse.

When you say 'efficient', you mean 'more efficient than the existing O(n) method' - I posit you can't actually get more efficient than O(n) without playing pretty serious semantic games about what 'clone' means.

answered Nov 03 '10 at 18:43

Ari Gesher

603
4
6

2

No, when I say "efficient", I don't mean a better complexity. You're correct in saying that it will be a O(n) operation anyway, but it isn't only about complexity. Consider this: `List.Add` has a O(1) complexity, like `HashSet.Add`, but it's much *faster* because it doesn't need to check if the item is already present. So when I say "efficient", I mean faster, not less complex. – Thomas Levesque Nov 03 '10 at 19:13
It's not O(n). It's O(n log n). Everytime they insert new stuff, they got to look if the stuff already exist. – user4951 Dec 13 '22 at 06:03

score -3 · Answer 6 · answered Nov 03 '10 at 14:54

Just a random thought. It might be silly.

Since they did not implement ICloneable, and the constructor does not use the knowledge that the source is of the same type, I guess we're left with one option. Implementing the optimized version and adding it as an extension method to the type.

Something like:

namespace ExtensionMethods
{
    public static class MyExtensions
    {
        public static HashSet<int> Clone(this HashSet<int> original)
        {
            HashSet<int> clone = new HashSet<int>();
            //your optimized code here 
            return clone;
        }
    }   
}

Then, your code from the question would look like this:

HashSet<int> original = ...
HashSet<int> clone = HashSet<int>.Clone(original);

And what would you put in place of the comment ? This is what my question was about... — Thomas Levesque, Nov 03 '10 at 15:08

Efficient way to clone a HashSet?

6 Answers6

Linked