2

I'm doing a lot of work at the moment with greedy algorithms - they don't care about indexing etc, they only work on groups/sets. But I'm finding that 85% of my execution time is spent trying to pick an item out of a HashSet.

According to MSDN docs:

The HashSet class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

...and that appears to be true for everything EXCEPT "fetching a value".

I've tried:

  • ElementAt(0) - extremely slow, allegedly because HashSet goes and generates an adhoc ordering, sorts everything, then returns whatever was first
  • First - extremely slow (presumably it's doing the same thing)

What I expected:

  • AnyItem() - returns an item from the HashSet, with no guarantees. Could be first, could be last, could be random (but don't count on that).

...but I can't seem to figure out a way of doing that.

EDIT2: slightly more detailed source code, to make it clear why some trivial workarounds for this missing feature in HashSet don't help, and hopefully to show better why HashSet is the right class in all other ways:

HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;

... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items

while( candidates.Count > 0 && unvisited.Count > 0 )
{
  var anyItem = candidates.First();

  while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
  {
     candidates.Remove( anyItem );
     if( candidates.Count > 0 )
        anyItem = candidates.First();
     else
     {
        anyItem = null;
        break;
     }
  }

  if( anyItem == null ) // we've run out of candidates
     break;

  // For the algorithm: "processing" anyItem has a side-effect of 
  // transferring 0 or more items from "unvisited" into "candidates"
  var extraCandidates = Process( anyItem, unvisited );
  // ... Process probably does some set intersections
  
  ... // add all the extraCandidates to candidates
  ... // remove all the extraCandidates from unvisited
  
  candidates.Remove( anyItem );
}

i.e.: Typical greedy algorithm that has several sets: one set of "starting points for next iteration", and one or more sets (here I've only shown one) of "data that hasn't been processed yet, and is somehow connected to / reachable from the starting points".

...everything there is fast, the only thing that's slow is the "First" call - and I have no reason to take the first, I could take any, but I need to take something!

Adam
  • 32,900
  • 16
  • 126
  • 153
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/222470/discussion-on-question-by-adam-picking-any-item-from-a-hashset-is-very-slow). – Samuel Liew Oct 04 '20 at 05:33

2 Answers2

2

It seems that the HashSet class is not optimized for the scenario where its first item is repeatedly removed. The space reserved internally by this class is not reduced after each removal, and instead the corresponding slot is marked as empty. The enumerator of the class enumerates all the internal slots, and yields a value whenever it finds a non-empty slot. This can become extremely inefficient when the internal space of a HashSet has become sparsely populated. For example a HashSet that once hold 1,000,000 elements and has been reduced to a single element, must enumerate 1,000,000 slots before yielding the element stored in its single non-empty slot:

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
set.ExceptWith(Enumerable.Range(1, 999_999));
var item = set.First(); // Very slow

This is a problem that is not easy to solve. One solution is to call the TrimExcess method after each batch of deletions. This method minimizes the space reserved internally by the class, by allocating a new array of slots, copying the items from the existing array to the new one, and finally discarding the old array. This is an expensive operation, so calling TrimExcess too frequently could become the new bottleneck of your app.

Another solution could be to use a third-party implementation that doesn't suffer from this problem. For example the Rock.Collections library contains an OrderedHashSet class that keeps the items in the order in which they are added. It achieves this by connecting the internal slots in a linked-list manner. The class can be enumerated not only in normal but also in reversed order. I didn't tested it, but most probably calling First should be an O(1) operation.

Below is a solution that uses reflection to trick the built-in enumerator into starting the enumeration of the slots from a random index, instead of the index 0. It offers acceptable performance, but it suffers from the known problems of reflection (overhead, forward compatibility etc). The static GetRandom method is located inside a generic static class, in order to cache the FieldInfo information separately for each type T.

public static class HashSetRandomizer<T>
{
    private static FieldInfo _lastIndexField;
    private static FieldInfo _indexField;
    private static ThreadLocal<Random> _random;

    static HashSetRandomizer()
    {
        const BindingFlags FLAGS = BindingFlags.NonPublic | BindingFlags.Instance;
        _lastIndexField = typeof(HashSet<T>).GetField("m_lastIndex", FLAGS) // Framework
            ?? typeof(HashSet<T>).GetField("_lastIndex", FLAGS); // Core
        _indexField = typeof(HashSet<T>.Enumerator).GetField("index", FLAGS) // Framework
            ?? typeof(HashSet<T>.Enumerator).GetField("_index", FLAGS); // Core
        _random = new ThreadLocal<Random>(() => new Random());
    }

    public static T GetRandom(HashSet<T> source, Random random = null)
    {
        if (source == null) throw new ArgumentNullException(nameof(source));
        random = random ?? _random.Value;
        if (_lastIndexField == null)
            throw new NotSupportedException("FieldInfo lastIndex not found.");
        if (_indexField == null)
            throw new NotSupportedException("FieldInfo index not found.");
        if (source.Count > 0)
        {
            int lastIndex = (int)_lastIndexField.GetValue(source);
            if (lastIndex > 0)
            {
                var randomIndex = random.Next(0, lastIndex);
                using (var enumerator = source.GetEnumerator())
                {
                    _indexField.SetValue(enumerator, randomIndex);
                    if (enumerator.MoveNext()) return enumerator.Current;
                }
            }
            foreach (var item in source) return item; // Fallback
        }
        throw new InvalidOperationException("The source sequence is empty.");
    }
}

Usage example. Items are removed randomly from a HashSet, until the set is empty.

var set = new HashSet<int>(Enumerable.Range(1, 1_000_000));
while (set.Count > 0)
{
    var item = HashSetRandomizer<int>.GetRandom(set); // Fast
    set.Remove(item);
}

Removing the last few items is still quite slow, even with this approach.

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
0

Using First() all the time requires that the internal Enumerator struct is build all the time. But since it is not important which element you get you can retrieve the IEnumerator object only once and then keep read data from it. So it is basically a normal foreach loop over the HashSet you have to work with the entries.

To prevent any "Collection was modified" exceptions you must not remove the proceeded entry from the HashSet until your iteration is complete. So you can save the entries which has been proceeded and delete them afterwards. The source code might look like this:

HashSet<MyClass> hs /// approx 500,000 items

while(/* metadata based on what's been processed*/ ) // might be adjusted now
{    
    Set<MyClass> toDelete = new HashSet<MyClass>();
    while (MyClass entry in hs) // get Enumerator only once, then iterate normally
    {
        if(ShouldProcess(entry))
        {
            Process(entry);
            toDelete.Remove(entry);
        }
    }
    // finally delete them
    foreach (MyClass entry in toDelete)
    {
        hs.Remove(entry);
    }
}

As you iterate over the whole HashSet and not run your "metadata" check after each entry you might need to adjust your outer while loop due to the fact, that in one outer while loop iteration the whole HashSet is iterated (and entries aren't deleted immediately).

Progman
  • 16,827
  • 6
  • 33
  • 48
  • This would definitely help, in some cases significantly (although still a lot slower than if we had a .Any() method), but for this class of algorithms (greedy algorithms for image analysis, and for mesh manipulation - former works on 2d pixels/texels, latter works on 3D vertices) items tend to get removed in batches, which can be as small as 1 or as large as 10,000's. – Adam Oct 03 '20 at 17:32
  • I'm wondering if a cached "most recently added" list would do even better: I could tune the maximum number of entries to cache, and remove from it in parallel with removing from the hashset. Again: it's still worse than having an Any() method, but with tuning it should be able to get quite close in performance. But it feels strange that we're having to work so hard to emulate a method that is core for so many Set-based algorithms. – Adam Oct 03 '20 at 17:36
  • @Adam I thought about such a cache too, but then realized that you would run in the same problem again. When you have proceed the "most recently added" entry, you remove it from the cache and HashSet, but where do you get the new cache value now? You will have to access it from the HashSet with the performance issue we try to fix with the cache in the first place. – Progman Oct 03 '20 at 17:42
  • Most algorithms: this HashSet will grow and shrink during execution (it's a bag of candidates "that haven't been processed yet", and some processing will add new items to the bag). In those cases, I could dual add to both the set and to the cache. NB: when I asked the question, I thought I was making some simple oversight/mistake in not finding/using the correct "get me an item, I don't care which". From the responses it looks like there is no such simple thing, which converts the question into something more complex than I expected, sorry :( – Adam Oct 03 '20 at 19:03
  • This doesn't invalidate your answer, but to make it clearer I've expanded the source code example in the question to show a more complex case - it's the same core loop, but shows where set actions might happen, and where the HashSet can grow as well as shrink. – Adam Oct 03 '20 at 19:24
  • @Adam The problem is the creation of the [`HashSet.Enumerator`](https://docs.microsoft.com/en-us/dotnet/api/system.collections.generic.hashset-1.enumerator?view=netcore-3.1) every time you call `First()`, even though you only fetch one object. It's an overhead you should avoid. This solution shows that you the ratio of `GetEnumerator()` and "fetching a value" will be `1:N`, instead of being `1:1` when using `First()`. For your code you might have to save the new candidates in a list first, before you actually add them to the `candidates` set to avoid "Collection was modified" exceptions. – Progman Oct 03 '20 at 19:34