In C# why is it faster to Create a HashSet from a List, instead of starting with a HashSet?

Question

I have a method that takes an upper limit, and returns a list of primes numbers up to that limit.

    public static List<int> AllPrimesUnder(int upperLimit)

I later decided that I really just needed to do lookups on the list, often just asking the question "Is This Prime". Since I was dealing with all primes under values like a million, I realized that HashSet was the structure I should be using. Certainly the lookup using the result of the method was faster, but the method its self was slower.

I believe the reason it's slower is because HashSet checks for duplicates before adding, while a List just shoves it on the end. What surprised me, and what spawned the question and title, is why starting with a List and using it to create HashSet, like so:

    hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));

is faster than using a Hashset internal to the method, enabling a call like this:

    hashSet = Prime.AllPrimesUnder_Hash(1000000);

If the slowdown is in the duplicate checking, it should have to do the same amount of checking no matter what, right? This is likely where my understanding is failing me.

Here are the times I'm getting for primes under one million.

0.1136s Pure Hash
0.0975s Pure List (expected to be faster)
0.0998s Pure List Converted to Hash (not expected)

If the reason for this can be explained in simple terms, I'd love to hear it. I suppose at a minimum what I'm looking for is enough of an understanding to know if I should start with a List or a HashSet if the end result will be a large HashSet of items.

I've added the body of the prime method below, but note that all interaction with the data structure is identical (code wise) between the two. I don't believe how I add data to the structure should effect the anomaly.

    public static List<int> AllPrimesUnder(int upperLimit)
    {
        List<int> primeList = new List<int>();
        primeList.Add(2);
        int testNumber = 3;
        bool isPrime;

        while (testNumber <= upperLimit)
        {
            isPrime = true;

            foreach (int prime in primeList)
            {
                if (testNumber % prime == 0)
                {
                    isPrime = false;
                    break;
                }
                if (testNumber < prime*prime)
                    break;
            }

            if (isPrime)
                primeList.Add(testNumber);

            testNumber++;
        }

        return primeList;
    }

Edit: By request I am adding the code for the Hash method. If it looks nearly identical that's because it is.

public static HashSet<int> AllPrimesUnder_Hash(int upperLimit)
{
    HashSet<int> primeHash = new HashSet<int>();
    primeHash.Add(2);
    int testNumber = 3;
    bool isPrime;

    while (testNumber <= upperLimit)
    {
        isPrime = true;

        foreach (int prime in primeHash)
        {
            if (testNumber % prime == 0)
            {
                isPrime = false;
                break;
            }
            if (testNumber < prime*prime)
                break;
        }

        if (isPrime)
            primeHash.Add(testNumber);

        testNumber++;
    }

    return primeList;
}

Also by request is the (ugly hackish) code I used to test the execution time:

        Stopwatch stopWatch = new Stopwatch();
        int iterations = 1;
        HashSet<int> hashSet = new HashSet<int>();
        List<int> list = new List<int>();

        stopWatch.Restart();
        for (int i = 0; i < iterations; i++)
        {
            hashSet = Prime.AllPrimesUnder_Hash(1000000);
        }
        stopWatch.Stop();

        Console.WriteLine("Hash: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));

//////////////////////////

        stopWatch.Restart();
        for (int i = 0; i < iterations; i++)
        {
            hashSet = new HashSet<int>(Prime.AllPrimesUnder(1000000));
        }
        stopWatch.Stop();


        Console.WriteLine("List converted: " + (stopWatch.Elapsed.TotalSeconds / iterations).ToString("#.###################"));

How did you account for testing errors (for example, maybe your CPU happened to be busy for 0.02 seconds during the pure hash run)? — mbeckish, Sep 24 '13 at 17:10
What's your testing/benchmarking code like? Is it running in release mode? A difference of 2.3 _milliseconds_ is _very_ minor and subject to a whole _host_ of other outside factors (operating system usage, memory, other programs running, butterflies in Japan flapping their wings, etc.) — Chris Sinclair, Sep 24 '13 at 17:11
Did you run your benchmark in release mode (not via debugger), did a warm-up, used a precise enough timer and a big enough number of iterations to compute the average? Before drawing conclusions, be sure your benchmark is appropriate. — Pierre-Luc Pineault, Sep 24 '13 at 17:14
Another possibility. Did you call `.Add()` or simply `hash[n] = true;` — , Sep 24 '13 at 17:22
You may want to look in to some of the other algorithms for finding primes, for example the [Sieve of Eratosthenes](http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes) or the more modern [Sieve of Atkin](http://en.wikipedia.org/wiki/Sieve_of_Atkin) are very easy ways find primes that requires a lot fewer tests. — Scott Chamberlain, Sep 24 '13 at 17:23
I wondering if my testing methods would be questioned. I take it I'm not the only one who thinks the results were odd, that makes me feel better! Testing: Release Mode - No Debugger Multiple iterations with avg time. Multiple runs, with slight modifications to the code to see if I could make it faster. Using SystemDiagnostic.StopWatch — Earendil, Sep 24 '13 at 17:27
Show us AllPrimesUnder_Hash. Also, how many items will the collections contain? — usr, Sep 24 '13 at 17:37
@usr AllPrimesUnder_Hash is identical. Sorry if that wasn't clear. For one million primes you're looking at around 78,000-ish numbers. — Earendil, Sep 24 '13 at 17:47
@Earendil just to make sure (because my answer depends on it): it is identical *except* that `List primeList = new List();` is now `HashSet primeList = new HashSet();`. Right? — usr, Sep 24 '13 at 17:49
@usr: I read it as the last line of the method either is `return primeList;` and the `HashSet` constructor is called _outside_ the method, or the last line is `return new HashSet(primeList);`. Otherwise they're identical. I agree though, Earendil, I think you should post your `Prime.AllPrimesUnder_Hash` method because it's a bit ambiguous what you have going on here. (I might also suggest you post your benchmark code) — Chris Sinclair, Sep 24 '13 at 17:53
@ChrisSinclair Code posted. Sorry about the confusion. Also see the single code lines near the top that state what the calling method looks like. — Earendil, Sep 24 '13 at 19:13
related q that can help [what-does-a-hashset-do-with-memory-when-initializing-a-collection](http://stackoverflow.com/questions/11557056/what-does-a-hashset-do-with-memory-when-initializing-a-collection) — nawfal, May 25 '14 at 11:09
I just got tripped out because in the [reference source](http://referencesource.microsoft.com/#System.Core/System/Collections/Generic/HashSet.cs,9fbbf8ff7dbc0962), a HashSet's size must be a prime! I was following up on the comment by @pipTheGeek. I found it funny that your example was testing if something `isPrime` and there is call to `IsPrime()` in HashSet's source! — Zach Mierzejewski, Jul 01 '16 at 17:26

Magnus · Answer 1 · 2013-09-25T08:38:12.583

14

The reason is that when the HashSet is initialized with a collection it can use the size of the collection to set the capacity. When adding values to an empty HashSet the capacity needs to be increased from time to time and that is an O(n) operation.
For some reason the HashSet does not take capacity as a parameter in the constructor like the List does.

edited Sep 25 '13 at 08:38

answered Sep 24 '13 at 17:17

Magnus

45,362
8
80
118

Right: and FWIW, the same extra work expanding the backing store was done at some point adding the items to the list as well. Always best to set the capacity early, if you can. – Joel Coehoorn Sep 24 '13 at 17:20
When I first read you answer it seems plausible, but then I started thinking about it. When creating the Pure List, it would have to go through the same expansion mechanism as I do not give it an initial size large enough to fit an arbitrary number of primes. Unless List Expansion is significantly faster than HashSet expansion, I don't think that can account for it taking 1.14x longer. Do you? – Earendil Sep 24 '13 at 17:50
I read it as the `Prime.AllPrimesUnder_Hash` method does _exactly the same thing_ as `Prime.AllPrimesUnder`, but instead of returning a `List`, it returns `new HashSet(primeList);`. Essentially, the only difference tested was having `HashSet` instantiation in the method or outside the method. – Chris Sinclair Sep 24 '13 at 17:50
11

Growing a list is simply a new memory allocation, then a straightforward memory copy of the existing items. Growing a HashSet on the other hand, requires a new memory allocation and then recomputing the hash of every item and adding it to the appropriate place. – pipTheGeek Sep 24 '13 at 18:19
@pipTheGeek thats worth making an answer. – nawfal May 25 '14 at 10:49

score 4 · Accepted Answer · answered Sep 24 '13 at 17:38

4

In AllPrimesUnder you are enumerating the prime list many times (once for each prime candidate). Enumerating a List is faster than enumerating a HashSet because the internal array of the HashSet is more sparse.

Not seeing the code for AllPrimesUnder_Hash I guess that this is the main cause.

I'm not convinced that resizing a list of a few thousand items could consume 20ms. Copying memory using memcpy (which is what happens internally) is one of the highest-throughput operations you can do. You can copy tens of gigabytes per second per core.

answered Sep 24 '13 at 17:38

usr

168,620
35
240
369

1

I did not know that enumerating over a HashSet was slower than a List. I missed that somewhere along the way. The code for using a Hash and a List is identical for each method, that is why I didn't include it. The enumeration and .Add() are the only interaction the method has with the data structure. I think I'll try to pair down and test the enumeration difference between the two large sets. – Earendil Sep 24 '13 at 17:53
After some testing I believe this to be the answer in my case. I'm enumerating way more than I am expanding a list, though I have no doubt that HashSet expansion takes longer than List expansion due to rehashing, it doesn't account for as much time as the enumeration difference in my case. – Earendil Sep 26 '13 at 01:07

Gary Walker · Answer 3 · 2013-09-24T18:59:55.013

Looking at your algorithm I suspect that the pure hash is slower because it is a hash, not an ordered list. When using an ordered list, you test divisibility against 2, 3, 5, 7, etc. in order, so the smaller divisors (which are more commonly divisors) are tested first. When using a hash, the order is arbitrary, so you may test for divisible by 23 before you test for divisible by 3.

BTW, you should use testnumber += 2, and exclude 2 from your list of primes, inserting 2 when you are done with your loop.

Even better, Sieve of Eratosthenes is usually a faster way to calculate all primes for relatively small numbers. Or even better, precompute your low value primes and load it from disk

EDIT -- ADDED

Not what I expected initially (a hash being out of order), but it just looks like a good bit more overhead in the MoveNext() -- which is how foreach works internally

Compare the difference in the MoveNext() functions -- which you will call millions of times in the innermost loop.

// HashSet<>.MoveNext()
public bool MoveNext()
{
    if (this.version != this.set.m_version)
    {
        throw new InvalidOperationException(SR.GetString("InvalidOperation_EnumFailedVersion"));
    }
    while (this.index < this.set.m_lastIndex)
    {
        if (this.set.m_slots[this.index].hashCode >= 0)
        {
            this.current = this.set.m_slots[this.index].value;
            this.index++;
            return true;
        }
        this.index++;
    }
    this.index = this.set.m_lastIndex + 1;
    this.current = default(T);
    return false;
}


List<>.MoveNext()
public bool MoveNext()
{
    List<T> list = this.list;
    if ((this.version == list._version) && (this.index < list._size))
    {
        this.current = list._items[this.index];
        this.index++;
        return true;
    }
    return this.MoveNextRare(); // this call should be rare as the name implies
}

private bool MoveNextRare()
{
    if (this.version != this.list._version)
    {
        ThrowHelper.ThrowInvalidOperationException(ExceptionResource.InvalidOperation_EnumFailedVersion);
    }
    this.index = this.list._size + 1;
    this.current = default(T);
    return false;
}

I thought the lack of ordering would effect it too, however based on the code it'll break off and stop testing when the primes get to the square of the test number. If it weren't enumerating them in the order they were added, I wouldn't get the correct results, which I do. — Earendil, Sep 24 '13 at 18:03
In the code you posted, you do not stop at the square root of the number being evaluated. Microsoft specificially cautions that HashSet<> is in order. I was surprised at you answer, so I implemented my own code and as each prime was added, I tested by using a foreach(prime) on the HashSet and verified that the hashset was in fact in order. This was quite unexpected to me. — Gary Walker, Sep 24 '13 at 18:41
Unless there is a defect in my code, the line that states: if (testNumber < prime*prime)break; Will indeed stop enumerating the list of primes and stop testing the number in question. Which is the same as saying (Math.sqrt(testNumber) < prime), but computationally faster. — Earendil, Sep 24 '13 at 19:07

In C# why is it faster to Create a HashSet from a List, instead of starting with a HashSet?

3 Answers3

Linked