2

I needed to access the asymptotic time and space complexity of the IEnumerable.Distinct in big O notation

So I was looking at the implementation of extension method Enumerable.Distinct and I see it is implemented using and internal class Set<T>, which is almost a classical implementation of a hash table with "open addressing"

What quickly catches the eye is that a lot of code in Set<T> is just a copy-paste from HashSet<T>, with some omissions

However, this simplified Set<T> implementation has some obvious flaws, for example the Resize method not using prime numbers for the size of the slots, like HashSet<T> does, see HashHelpers.ExpandPrime

So, my questions are:

  1. What is the reason for code duplication here, why not stick with DRY principle? Especially given the fact that both of these classes are in the same assembly System.Core
  2. It looks like HashSet<T> will perform better, so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>?
BoltClock
  • 700,868
  • 160
  • 1,392
  • 1,356
ironstone13
  • 3,325
  • 18
  • 24
  • 3
    I guess you'd get a much better answer from the actual developers instead of here. The only thing that's actually answerable on SO would be your second question and even there the advice will be »profile, and then treat the bottlenecks«. Also note that there have been a few changes over time, e.g. ISet didn't exist prior to .NET 4, so it might just be a history thing. – Joey Jan 31 '17 at 09:58
  • Why would this be a concern? Is this a hot path in your code where you're noticing a performance problem? Or just out of curiosity? – Yuval Itzchakov Jan 31 '17 at 10:01
  • I'm doing some tasks on Codility, and comparing my solutions to others, and, basically I need to have a very solid understanding of what will be the asymptotic performance of my code - and this is an observation that I made in the process. One of the basic things that all software development books say is that code duplication is a no-no, so I thought there must be a good reason for this – ironstone13 Jan 31 '17 at 10:05
  • Maybe interesting to note, Java uses powers of 2 for the size instead of prime numbers. The point is to avoid collisions. Who knows if using prime numbers is better, but are you going to notice the difference? – Dennis_E Jan 31 '17 at 10:33
  • 6
    The LINQ project had a very long lead time, the first MSDN forum threads about the alpha version started appearing in September 2005. That was before .NET 2.0 was released (Nov 2005). HashSet was released in 3.5. So a very simple explanation is that the programmer did not have HashSet available yet. – Hans Passant Jan 31 '17 at 10:35
  • @Dennis_E, when computing an index in the array that backs up the hashtable, we often use the hash code of the key mod size of the array. At first glance having even array sizes increases the probability of collisions, under the assumption of uniform key distribution. But I might be wrong – ironstone13 Jan 31 '17 at 10:42
  • 1
    @ironstone13 You are right that `Set` is not using prime sizes. But it uses odd (starts with 7, expands to 2 * size + 1) sizes, which is close and cheap alternative. – Ivan Stoev Jan 31 '17 at 11:01
  • @ironstone13 `I need to have a very solid understanding of what will be the asymptotic performance of my code` If you're looking at *asymptotic complexity* then this is of no concern to you. Minor optimizations like this aren't affecting the asymptotic complexity of the code. – Servy Jan 31 '17 at 16:48
  • 1
    Keep in mind that `HashSet` cannot be used everywhere. See its `[System.Security.Permissions.HostProtection(MayLeakOnAbort = true)]` attribute. An example where it cannot be used is in an SQL Server CLR stored procedure. I *think* `Enumerable.Distinct` can be used there, but I'm not 100% sure. (I suspect this was not the original reason, which is why I'm not posting this as an answer.) –  Jan 31 '17 at 18:06
  • 6
    "Code duplication is bad" as a moral principle is just dogma; there's no fundamental moral problem. The real principle is "duplicating code can lead to increased costs because you then might have to implement the same feature or fix the same bug n times". But *deduplicating* code can lead to increased costs because more general code that tries to be all things to all people can be expensive to develop, fragile when modified, and hard to tune. Pick the technique that empirically lowers costs, rather than dogmatically following moral rules. – Eric Lippert Jan 31 '17 at 18:24
  • @EricLippert, I could not agree more, and of course I am not driven by moral dogma, DRY is just a short way of saying beware of maintenance costs. The cost of deduplication is much higher once the code gets to production, this is why in many cases it makes sense to adhere to DRY, and not allow duplication in the first place, where it is reasonable and feasible. – ironstone13 Feb 01 '17 at 08:20

1 Answers1

6

which is almost a classical implementation of a hash table with "open addressing"

Look again. It's separate chaining with list head cells. While the slots are all in an array, finding the next slot in the case of collision is done by examining the next field of the current slot. This has better cache efficiency than using linked lists with each node as a separate heap object, though not as good as open addressing in that regard. At the same time, it avoids some of the cases where open addressing does poorly.

a lot of code in Set is just a copy-paste from HashSet, with some omissions

AFAICT the reason a private implementation of a hash-set was used is that Enumerable and HashSet were developed independently at about the same time. That's just conjecture on my part, but they were both introduced with .NET 3.5 so it's feasible.

It's quite possible that HashSet<T> started by copying Set<T> and then making it better serve being exposed publicly, though it's also possible that the two were both based on the same principle of separate chaining with list head cells

In terms of performance, HashSet's using prime numbers means its more likely to avoid collisions with poor hashes (but just how much an advantage that is, is not a simple question), but Set is lighter in a lot of ways, especially in .NET Core where some things it doesn't need were removed. In particular, that version of Set takes advantage of the fact that once an item is removed (which happens, for example, during Intersect) there will never be an item added, which allows it to leave out freelist and any work related to it, which HashSet couldn't do. Even the initial implementation is lighter in not tracking a version to catch changes during enumeration, which is a small cost, but a cost to every addition and removal nevertheless.

As such, with different sets of data with different distributions of hash codes sometimes one performs better, sometimes the other.

Especially given the fact that both of these classes are in the same assembly System.Core

Only in some versions of .NET, in some they're in separate assemblies. In .NET Core we had two versions of Set<T>, one in the assembly that has System.Linq and one in the separate assembly that has System.Linq.Expressions. The former got trimmed down as described above, the latter replaced with a use of HashSet<T> as it was doing less there.

Of course System.Core came first, but the fact that those elements could be separated out at all speaks of System.Core not being a single monolithic blob of inter-dependencies.

That there is now a ToHashSet() method in .NET Core's version of Linq makes the possibility of replacing Set<T> with HashSet<T> more justifiable, though not a no-brainer. I think @james-ko was considering testing the benefits of doing that.

It looks like HashSet<T> will perform better

For the reasons explained above, that might not be the case, though it might indeed, depending on source data. That's before getting into considerations of optimisations that go across a few different linq methods (not many in the initial versions of linq, but a good few in .NET Core).

so should I avoid using Distinct extension method, and write my own extension method that would use HashSet<T> instead of Set<T>.

Use Distinct(). If you've a bottle neck then it might be that HashSet<T> will win with a given data-set, but if you do try that make sure your profiling closely matches real values your code will encounter in real life. There's no point deciding one approach is the faster based on some arbitrary tests if your application hits cases where the other does better. (And if I was finding this a problem spot, I'd take a look at whether the GetHashCode() of the types in question could be improved for either speed or distribution of bits, first).

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251