41

I am in need of a data type that is able to insert entries and then be able to quickly determine if an entry has already been inserted. A Dictionary seems to suit this need (see example). However, I have no use for the dictionary's values. Should I still use a dictionary or is there another better suited data type?

public class Foo
{
    private Dictionary<string, bool> Entities;

    ...

    public void AddEntity(string bar)
    {
        if (!Entities.ContainsKey(bar))
        {
            // bool value true here has no use and is just a placeholder
            Entities.Add(bar, true);
        }
    }

    public string[] GetEntities()
    {
        return Entities.Keys.ToArray();
    }

}
reformed
  • 4,505
  • 11
  • 62
  • 88
  • 1
    An interesting side note on the subject: Java's [`HashSet`](http://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html) is actually just a wrapper around `HashMap`. They do exactly what you suggest, but they put a prettier face on it. =) – jpmc26 Mar 04 '17 at 08:54
  • 1
    In http://stackoverflow.com/questions/2728500/hashsett-versus-dictionaryk-v-w-r-t-searching-time-to-find-if-an-item-exist statistical compare showing as a chart – Mohammad Akbari Mar 05 '17 at 03:37

2 Answers2

85

You can use HashSet<T>.

The HashSet<T> class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.

Habib
  • 219,104
  • 29
  • 407
  • 436
  • 15
    Note that unlike dictionary there is no need to check for existance before adding. – Evk Mar 03 '17 at 17:17
  • 1
    @Evk - you don't need to check for existence before adding to a Dictionary either if you use the indexer. – hoodaticus Mar 03 '17 at 17:32
  • 22
    Yes but in the event that the information is useful for other reasons, `HashSet.Add()` returns a `bool` indicating if the item was successfully added (not there before) – pinkfloydx33 Mar 03 '17 at 17:56
  • 6
    @pinkfloydx33 Did not know that! Just managed to remove about 30 lines of code from our system as we'd written a helper method that did just that! Thanks – Matthew Steeples Mar 03 '17 at 18:53
  • 5
    @MatthewSteeples in that case I feel I should mention that `HashSet.Remove` works the same way. It returns a `bool`; `true` if the item was found and removed, `false` otherwise. No exception thrown, no need to call `Contains`. – pinkfloydx33 Mar 03 '17 at 21:14
  • 1
    @hoodaticus: Yes that's true but using the index to add items to a dictionary will overwrite existing items with the same key. Unless this is the desired way you'd like your code to run, it can lead to unexpected results when querying items therein. – Alex Essilfie Mar 07 '17 at 22:13
3

Habib's answer is excellent, but for multi-threaded environments if you use a HashSet<T> then by consequence you have to use locks to protect access to it. I find myself more prone to creating deadlocks with lock statements. Also, locks yield a worse speedup per Amdahl's law because adding a lock statement reduces the percentage of your code that is actually parallel.

For those reasons, a ConcurrentDictionary<T,object> fits the bill in multi-threaded environments. If you end up using one, then wrap it like you did in your question. Just new up objects to toss in as values as needed, since the values won't be important. You can verify that there are no lock statements in its source code.

If you didn't need mutability of the collection then this would be moot. But your question implies that you do need it, since you have an AddEntity method.

Additional info 2017-05-19 - actually, ConcurrentDictionary does use locks internally, although not lock statements per se--it uses Monitor.Enter (check out the TryAddInternal method). However, it seems to lock on individual buckets within the dictionary, which means there will be less contention than putting the entire thing in a lock statement.

So all in all, ConcurrentDictionary is often better for multithreaded environments.

It's actually quite difficult (impossible?) to make a concurrent hash set using only the Interlocked methods. I tried on my own and kept running into the problem of needing to alter two things at the same time--something that only locking can do in general. One workaround I found was to use singly-linked lists for the hash buckets and intentionally create cycles in a list when one thread needed to operate on a node without interference from other threads; this would cause other threads to get caught spinning around in the same spot until that thread was done with its node and undid the cycle. Sure, it technically didn't use locks, but it did not scale well.

Community
  • 1
  • 1
Matt Thomas
  • 5,279
  • 4
  • 27
  • 59