Why is case-insensitive HashSet performance so bad

Question

Why is HashSet<string>(StringComparer.InvariantCultureIgnoreCase) (Perf_HashSet_CaseInsensitive) perf'ing (performing) so bad? The Perf_HashSet workaround is perf'ing 20x comparatively.

// perf Contains(): 1M iterations, 25 size (unsuccessful lookup)
Test                         Duration (ms)
Perf_HashSet                            43 <-
Perf_Dictionary                         49 
Perf_HybridDictionary                   63 
Perf_ListDictionary                    223 
Perf_List                              225 
Perf_HashSet_CaseInsensitive           903 <-

code:

// <TargetFramework>net5.0</TargetFramework>
[TestFixture]
public class ContainsPerfTests
{
    private const int iterations = 1_000_000;
    private const int size = 25;

    [Test]
    [Explicit]
    public void Perf_List()
    {
        var list = new List<string>();
        for (int i = 0; i < size; i++)
        {
            list.Add(Guid.NewGuid().ToString().ToLowerInvariant());
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = list.Contains(x.ToLowerInvariant());
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    [Test]
    [Explicit]
    public void Perf_HashSet()
    {
        var hashSet = new HashSet<string>();
        for (int i = 0; i < size; i++)
        {
            hashSet.Add(Guid.NewGuid().ToString().ToLowerInvariant());
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = hashSet.Contains(x.ToLowerInvariant());
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    [Test]
    [Explicit]
    public void Perf_HashSet_CaseInsensitive()
    {
        var hashSetCaseInsensitive = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase);
        for (int i = 0; i < size; i++)
        {
            hashSetCaseInsensitive.Add(Guid.NewGuid().ToString().ToLowerInvariant());
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = hashSetCaseInsensitive.Contains(x);
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    [Test]
    [Explicit]
    public void Perf_Dictionary()
    {
        var dictionary = new Dictionary<string, bool>();
        for (int i = 0; i < size; i++)
        {
            dictionary.Add(Guid.NewGuid().ToString().ToLowerInvariant(), false);
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = dictionary.ContainsKey(x.ToLowerInvariant());
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    [Test]
    [Explicit]
    public void Perf_HybridDictionary()
    {
        var hybridDictionary = new HybridDictionary(caseInsensitive: true);
        for (int i = 0; i < size; i++)
        {
            hybridDictionary.Add(Guid.NewGuid().ToString().ToLowerInvariant(), null);
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = hybridDictionary.Contains(x);
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    [Test]
    [Explicit]
    public void Perf_ListDictionary()
    {
        var listDictionary = new ListDictionary();
        for (int i = 0; i < size; i++)
        {
            listDictionary.Add(Guid.NewGuid().ToString().ToLowerInvariant(), null);
        }
        var x = Guid.NewGuid().ToString();
        var sw = Stopwatch.StartNew();
        for (int i = 0; i < iterations; i++)
        {
            var contains = listDictionary.Contains(x.ToLowerInvariant());
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }
}

Shouldn't you use `OrdinalIgnoreCase` to make a fair test? `InvariantCultureIgnoreCase` performs a culture-aware comparison using the rules of the invariant culture. — Sweeper, Oct 01 '21 at 18:51
They don't behave the same, so their performance characteristics aren't relevant. — Servy, Oct 01 '21 at 18:53
Also you're not counting the memory allocations you're creating with your `ToLower...()` calls, you will have to pay for those later with GC collections. — Blindy, Oct 01 '21 at 18:54
It seems redundant to use an ignore case comparator and also insert elements using the toLower() — Charlie, Oct 01 '21 at 18:56
@Blindy You have to pay the collection costs *for the strings that can actually be collected*. If in the actual code in question the original strings will stay rooted then you're increasing the memory footprint, not [just] adding GC pressure. — Servy, Oct 01 '21 at 19:20
@Charlie inserting elements using toLower() is just to keep the setup the same for all. — hIpPy, Oct 01 '21 at 21:14

score 5 · Accepted Answer · answered Oct 01 '21 at 19:56

The main reason for this big difference in performance is because you are using a culture-sensitive comparison for the case-insensitive hash set, whereas the case-sensitive hash set uses an ordinal one.

Without passing the hash set any equality comparers, the hash set uses the default equality comparer of that type, which calls the Equals method. Notice that String.Equals

performs an ordinal (case-sensitive and culture-insensitive) comparison.

Whereas StringComparer.InvariantCultureIgnoreCase:

performs a case-insensitive string comparison using the word comparison rules of the invariant culture.

It is important to note that "invariant culture" is not the same as "culture-insensitive", or "ordinal":

Console.WriteLine(StringComparer.InvariantCultureIgnoreCase.Equals( "ss", "ß")); // true Console.WriteLine(StringComparer.OrdinalIgnoreCase.Equals( "ss", "ß")); // false

The invariant culture still has collation rules, and looking up and applying those rules takes time.

Changing InvariantCultureIgnoreCase to OrdinalIgnoreCase should make the execution time almost the same. That doesn't mean there aren't other problems with your benchmark, as the comments have pointed out.

see also https://stackoverflow.com/questions/492799/difference-between-invariantculture-and-ordinal-string-comparison — hIpPy, Oct 03 '21 at 19:22

Why is case-insensitive HashSet performance so bad

1 Answers1