I'm doing a lot of work at the moment with greedy algorithms - they don't care about indexing etc, they only work on groups/sets. But I'm finding that 85% of my execution time is spent trying to pick an item out of a HashSet.
According to MSDN docs:
The HashSet class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
...and that appears to be true for everything EXCEPT "fetching a value".
I've tried:
- ElementAt(0) - extremely slow, allegedly because HashSet goes and generates an adhoc ordering, sorts everything, then returns whatever was first
- First - extremely slow (presumably it's doing the same thing)
What I expected:
- AnyItem() - returns an item from the HashSet, with no guarantees. Could be first, could be last, could be random (but don't count on that).
...but I can't seem to figure out a way of doing that.
EDIT2: slightly more detailed source code, to make it clear why some trivial workarounds for this missing feature in HashSet don't help, and hopefully to show better why HashSet is the right class in all other ways:
HashSet<MyClass> candidates;
HashSet<MyClass> unvisited;
... // fill unvisited with typically: 100,000 to 10 million items
... // fill candidates with at least 1 item, potentially 100,000's of items
while( candidates.Count > 0 && unvisited.Count > 0 )
{
var anyItem = candidates.First();
while( ! CanProcess( anyItem ) ) // CanProcess probable checks some set intersections
{
candidates.Remove( anyItem );
if( candidates.Count > 0 )
anyItem = candidates.First();
else
{
anyItem = null;
break;
}
}
if( anyItem == null ) // we've run out of candidates
break;
// For the algorithm: "processing" anyItem has a side-effect of
// transferring 0 or more items from "unvisited" into "candidates"
var extraCandidates = Process( anyItem, unvisited );
// ... Process probably does some set intersections
... // add all the extraCandidates to candidates
... // remove all the extraCandidates from unvisited
candidates.Remove( anyItem );
}
i.e.: Typical greedy algorithm that has several sets: one set of "starting points for next iteration", and one or more sets (here I've only shown one) of "data that hasn't been processed yet, and is somehow connected to / reachable from the starting points".
...everything there is fast, the only thing that's slow is the "First" call - and I have no reason to take the first, I could take any, but I need to take something!