For a list that must only have unique items in it, but will contain 0 or 1 items 99% of the time, does the overhead make a List better than a HashSet?

Question

I am currently working on a project where we have a set of events. One piece of analysis we do on the events is to look through a specific type of event and check to see if it's likely that it was prompted by another event which happened shortly before (or slightly after in one odd case). Each of these events can only be effected by a single event, but one event could be the causal event for multiple events. We want this association to go both ways so that, from any particular method, you can go straight to the event which caused it, or one of the events which it caused. Based on that, I started by adding the following properties to the Event objects and adding a funct:

protected Event causalEvent;
protected List<Event> effectedEvents;

After a bit of thinking, I considered that we never want the same item added twice to the effectedEvents list. After reading the answer to Preventing Duplicate List<T> Entries, I went with a Hashset.

protected Event causalEvent;
protected HashSet<Event> effectedEvents;

A co-worker and I got to discussing the code I'd added and he pointed out that using a HashSet might confuse people since he tended to see a HashSet and assume that there's a great deal of data. In our case, due to the rules being used in the algorithms, effectedEvents is going to have 0 items in about 90% of the cases, 1 item in 9%, and 2 maybe 1% of the time. Almost never will we have more than 2 items, although it is possible. I believe the lookup cost is the same for both collections. The amount of memory used is very similar since both start assuming a small capacity (although, I will concede that List gives you the ability to set that capacity in the constructor while HashSet only allows one to trim the value down based on its contents, "rounded to an implementation-specific value").

So, long question short, is there any real penalty to using a HashSet other than possible confusion for those unfamiliar with using it to ensure uniqueness?

Have you checked http://stackoverflow.com/questions/150750/hashset-vs-list-performance ? — Marco, Jun 18 '14 at 20:02
I bet in the time it took you to type up this question, you could have written an empirical test to determine which is faster on your machine. My money's on the List being faster. About the lookup cost being the same, I'd guess the list is faster because it doesn't need to generate a hashcode. — adv12, Jun 18 '14 at 20:02
Check http://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs — Faisal, Jun 18 '14 at 20:04
If the size of the set is that small, then performance almost certainly shouldn't matter, because it's going to be way faster than it needs to be anyway — Servy, Jun 18 '14 at 20:06
In my experience, I've seen a HashSet used quite often for ensuring uniqueness. I would use whatever makes the most sense to your team and forget the few milliseconds you will save doing one method or the other; however, since you bothered to write up this question, it seems it matters. In that case, use a List because the initialization is quicker and your lookups will be very fast if the list never gets bigger than two items. — David Sherret, Jun 18 '14 at 20:07

score 5 · Accepted Answer · edited May 23 '17 at 12:27

The analysis performed in this answer indicates that you only see a performance advantage with HashSet over List when you get to 5 strings, or 20 objects (of course, results will vary based on what you are doing). Since you are going to have 0-2 items in almost all cases, your best bet performance-wise is probably to use the List.

I would not worry about the confusion of those unfamiliar with using a HashSet to ensure uniqueness. That is one of the primary uses of a HashSet. Pick the best tool for the job, and if you think people will be confused, a short comment can help with that.

Also, though it is good to use the best performing coding strategy, you should also beware of spending too much time on micro-optimizations that can be premature. Unless you are using a lot of these objects, you probably will never notice the difference between List and HashSet in this case.

That makes sense. Honestly, the `HashSet` makes more sense to me, but since the other guy is one of the guys most often tapped to modify code, it makes more sense, in the end, to cede to his wishes. I'll admit that I wrote the question in part because I hoped rubber-duck debugging would give me the answer by the time I finished, but it just left me with more questions. — Sean Duggan, Jun 18 '14 at 20:16

score 2 · Answer 2 · answered Jun 18 '14 at 20:25

If you are after memory and performance you could use a plain object and put the event directly into the field. If you need more than one entry you can replace it on demand with a List or HashSet. Below is some code to show the concept. This gives you maximum speed with a much reduced memory footprint if most of the time the List/HashSet is empty.

This is in my opinion the most elegant solution for such sparse data structures.

using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace DynamicSet
{
    class Program
    {
        object DynamicSet; // can be null, one stored Event or a List/HashSet<Event> 
                           // depending on how many elements are needed.

        bool Contains(Event ev)
        {
            if( DynamicSet == null )
            {
                return false;
            }

            var storedEvent = DynamicSet as Event;
            if (storedEvent != null)
            {
                return Object.ReferenceEquals(ev, storedEvent);
            }
            var set = (HashSet<Event>)DynamicSet;
            return set.Contains(ev);
        }

        void AddEvent(Event ev)
        {
            if( DynamicSet == null )
            {
                DynamicSet = ev;
                return;
            }

            var hash = DynamicSet as HashSet<Event>;
            if( hash != null )
            {
                hash.Add(ev);
            }
            else
            {
                hash = new HashSet<Event>();
                hash.Add((Event)DynamicSet);
                DynamicSet = hash;
            }
        }

        static void Main(string[] args)
        {
            Program p = new Program();
            Event ev1 = new Event();
            Event ev2 = new Event();

            p.AddEvent(ev1);
            Debug.Assert(p.Contains(ev1));
            Debug.Assert(!p.Contains(ev2));
            p.AddEvent(ev1);
            Debug.Assert(p.Contains(ev1));
            Debug.Assert(!p.Contains(ev2));

            p.AddEvent(ev2);
            Debug.Assert(p.Contains(ev1));
            Debug.Assert(p.Contains(ev2));

        }
    }
}

For a list that must only have unique items in it, but will contain 0 or 1 items 99% of the time, does the overhead make a List better than a HashSet?

2 Answers2