14

I need to store a set of elements. What I need is functionality to

  1. remove (single) elements and
  2. add (sets of) elements and
  3. each object should only be in the set once and
  4. get a random element from the set

I chose the HashSet (C#) since it sports fast methods for removing elements (hashSet.remove(element)), adding sets (hashSet.UnionWith(anotherHashSet)) and the nature of a HashSet guarantees that there are not duplicates, so requirements 1 to 3 are taken care of.

The only way I found to get a random element is

Object object = hashSet.ElementAt(rnd.Next(hashSet.Count));

But this is very slow, since I call it once for every pixel of my map (creating a random flood fill from multiple starting points; mapsize 500x500 at the moment but I'd like to go bigger) and the hashset holds rather many items. (A quick test shows it blows up to 5752 entries before shrinking again.)

Profiling (CPU sampling) tells me my ElementAt calls take over 50%.

I realize 500x500 operations over a big hashset is no easy task, but other operations (Remove and UnionWith) are called as often as ElementAt, so the main problem seems to be the operation and not the number of calls.

I vaguely understand why getting a certain element from a HashSet is very expensive (when compared to getting it from a list or another ordered data structure, but I just want a random pick. Can it really be so hard and is there no way around it? Is there a better data structure for my purpose?

Changing everything to Lists doesn't help because now other methods become bottlenecks and it takes even longer.

Casting the HashSet to an array and pick my random element from there expectedly doesn't help because while picking a random element from an array is quick, casting the hashset to the array in the first place takes longer than running hashSet.ElementAt by itself.

If you want to understand better what I am trying to do: A link to my question and the answer.

Community
  • 1
  • 1
Christian Geese
  • 495
  • 6
  • 15
  • What are you removing? Is it only the randomly found element, or is it arbitrary? – spender May 27 '15 at 12:53
  • 2
    Why not do all your adding and removing with the HashSet, then before you want to do your random pixel-getting, just convert to a List once? Use that List, then throw away afterwards. Unless you need to be adding, removing and getting random elements at the same time... – Baldrick May 27 '15 at 12:53
  • @spender I remove the randomly found element only – Christian Geese May 27 '15 at 12:54
  • @Baldrick I fear it's the latter. The loop is basically: choose a random cell (the hashset holds all possible cells where the random floodfill can spread to, the "fringes") -> fill it -> find adjacent, empty cells and add them to the hashset -> remove filled cell from hashset -> Loop again till hashset is empty – Christian Geese May 27 '15 at 12:59
  • I feel that a two-dimensional linked list is going to be your friend here. – Enigmativity May 27 '15 at 13:12
  • Related: [Picking 'any' item from a HashSet is very slow - how to do this fast?](https://stackoverflow.com/questions/64186410/picking-any-item-from-a-hashset-is-very-slow-how-to-do-this-fast) – Theodor Zoulias Nov 10 '21 at 21:58

2 Answers2

8

I think that OrderedDictionary might suit your purposes:

var dict = new OrderedDictionary();

dict.Add("My String Key", "My String");
dict.Add(12345, 54321);

Console.WriteLine(dict[0]); // Prints "My String"
Console.WriteLine(dict[1]); // Prints 54321

Console.WriteLine(dict["My String Key"]); // Prints "My String"
Console.WriteLine(dict[(object)12345]);   // Prints 54321 (note the need to cast!)

This has fast add and remove, and O(1) indexing. It only works with object keys and values though - there's no generic version.

[EDIT] Many years later: We now have the strongly-typed generic SortedDictionary<TKey, TValue> which might be better.

Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
7

The basic problem is the indexing.

In an array or a list, the data is indexed by its coördinate - usually just a simple int index. In a HashSet, you pick the index yourself - the key. The side-effect is, though, that there is no "coördinate" - the question "element at index 3" doesn't make sense, really. The way it's actually implemented is that the whole HashSet is enumerated, item after item, and the n-th item is returned. This means that to get the 1000th item, you have to enumerate all the 999 items before that as well. This hurts.

The best way to solve this would be to pick the random based on an actual key of the HashSet. Of course, this only works if it's reasonable to pick random keys just like that.

If you can't pick the key at random in a satisfactory way, you'll probably want to keep two separate lists - whenever you add a new item to a HashSet, add its key to a List<TKey>; you can then easily pick a random key from the List, and follow it. Depending on your requirements, duplicates may not be much of a problem.

And of course, you could save on the ElementAt enumerations if you only do the enumeration once - for example, before searching the HashSet, you could convert it to List. This only makes sense if you're picking multiple random indices at once, of course (e.g. if you pick 5 indices at random at once, you'll save about 1/5th of the time on average) - if you're always picking one, then modifying the HashSet and picking another, it's not going to help.

Depending on your exact use case, it might also be worth having a look at SortedSet. It works in a similar way to HashSet, but it maintains order in the keys. The helpful part is that you can use the GetViewBetween method to get a whole range of keys - you could use this quite effectively if your keys are sparse, but well balanced between arbitrary ranges. You'd just first pick a range at random, then get the items in range with GetViewBetween, and pick a random one out of those as well. In effect, this will allow you to partition the search results, and should save quite a bit of time.

Luaan
  • 62,244
  • 7
  • 97
  • 116
  • 1
    Yes, I'm thinking a list and a hashset to index it. – spender May 27 '15 at 12:55
  • @spender Yeah, that can work quite well if you don't care about removing the garbage. If you do, though, it can get quite expensive. – Luaan May 27 '15 at 12:57
  • The objects from which I want to pick a random one are Cells in a grid, so it should be easy enough to give them a unique ID (x-coordinates to string + y-coordinates to string ?) So would I need to override GetHashCode in the Cell class if I want to "pick the random based on an actual key of the HashSet" ? – Christian Geese May 27 '15 at 13:34
  • @ChristianGeese What key are you using now? The whole cell? What kind of type is that? – Luaan May 27 '15 at 13:54
  • Yeah, the whole cell. Not sure what "type" means. It's just a custom class holding a few pieces of information, no inheritances. – Christian Geese May 27 '15 at 16:49
  • @ChristianGeese It's just a class, with no overridden `GetHashCode` and `Equals`? So you're actually relying on having the same *instance* of the cell in the `HashSet`? That sounds like a bad idea. And actually, thinking about it more, why are you using an approach like that for a *floodfill*? Why not just use a simple two-dimensional array with a stack? – Luaan May 28 '15 at 06:44
  • Yes that's right. I create all instances of the Cell first and then I just modify them, so there should be no duplicates, no? The floodfill is not a regular one but a) random (filling random cells at the fringes of the "flood") and b) starts from multiple locations. (Trying to generate "tectonic plates"). Using a 2d array could suffice though. I am very unexperienced in programming. :( – Christian Geese May 28 '15 at 08:50
  • I changed it to use two 2D arrays, it's blazingly fast (~15ms for 10 plates on a 500x500 area). Thanks a bunch!! – Christian Geese May 28 '15 at 14:34