24

Say a, b, c are all List<t> and I want to create an unsorted union of them. Although performance isn't super-critical, they might have 10,000 entries in each so I'm keen to avoid O(n^2) solutions.

AFAICT the MSDN documentation doesn't say anything about the performance characteristics of union as far as the different types are concerned.

My gut instinct says that if I just do a.Union(b).Union(c), this will take O(n^2) time, but new Hashset<t>(a).Union(b).Union(c) would be O(n).

Does anyone have any documentation or metrics to confirm or deny this assumption?

Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Andy
  • 10,412
  • 13
  • 70
  • 95
  • 5
    `a.Union(b).Union(c)` uses hashsets internally (and uses more memory). – Caramiriel Jul 03 '17 at 10:30
  • a) 30.000 items are not that many, unless you are using it often. b) There are other factors here, like the implementation of `Equals` of `t`. c) Another factor: Unlike Java, C# `HashSet` seems not to have a method to resize its internal array beforehand, that can lead to more issues with `HashSet`. – SJuan76 Jul 03 '17 at 10:36
  • 4
    `Enumerable.Union` has `O(n+m+p)` – Tim Schmelter Jul 03 '17 at 10:36
  • 1
    `a.Union( b.Concat( c ) )` is around 20% faster than `a.Union( b ).Union( c )` – Sir Rufo Jul 03 '17 at 11:42

4 Answers4

29

You should use Enumerable.Union because it is as efficient as the HashSet approach. Complexity is O(n+m) because:

Enumerable.Union

When the object returned by this method is enumerated, Union<TSource> enumerates first and second in that order and yields each element that has not already been yielded.

Source-code here.


Ivan is right, there is an overhead if you use Enumerable.Union with multiple collections since a new set must be created for every chained call. So it might be more efficient(in terms of memory consumption) if you use one of these approaches:

  1. Concat + Distinct:

    a.Concat(b).Concat(c)...Concat(x).Distinct()
    
  2. Union + Concat

    a.Union(b.Concat(c)...Concat(x))
    
  3. HashSet<T> constructor that takes IEnumerable<T>(f.e. with int):

    new HashSet<int>(a.Concat(b).Concat(c)...Concat(x))
    

The difference between the first two might be negligible. The third approach is not using deferred execution, it creates a HashSet<> in memory. It's a good and efficient way 1. if you need this collection type or 2. if this is the final operation on the query. But if you need to to further operations on this chained query you should prefer either Concat + Distinct or Union + Concat.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • While looking implementation of [Union method](https://github.com/Microsoft/referencesource/blob/master/System.Core/System/Linq/Enumerable.cs), it creates new set by `Set set = new Set(comparer);` but I am not sure with `Set` concrete type in c#. I didn't see using of Hashset in Union method.so, I am confused with that. – Akash KC Jul 03 '17 at 14:07
  • @AkashKC: `Set` is an internal class, you find the source [here](http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,9c10b234c0932864,references). Why not `HashSet` was used? I don't know. Afaik both were implemented at the same time, so maybe the `HashSet` was not ready to use and they wanted to avoid future conflicts. However, both are using similar algorithms. **Edit** I just found this question: https://stackoverflow.com/questions/41954470/why-hashsett-class-is-not-used-to-implement-enumerable-distinct – Tim Schmelter Jul 03 '17 at 14:13
  • Thank you for the response. Now,my confusion has been cleared. Not having feature of navigation for the definition of type in github makes life, sometime difficult :( Anyways, your comment makes it clear now. Thanks !!! – Akash KC Jul 03 '17 at 14:38
6

While @Tim Schmelter is right about linear time complexity of the Enumerable.Union method, chaining multiple Union operators has the hidden overhead that every Union operator internally creates a hash set which basically duplicates the one from the previous operator (plus additional items), thus using much more memory compared to single HashSet approach.

If we take into account the fact that Union is simply a shortcut for Concat + Distinct, the scalable LINQ solution with the same time/space complexity of the HashSet will be:

a.Concat(b).Concat(c)...Concat(x).Distinct()
Ivan Stoev
  • 195,425
  • 15
  • 312
  • 343
3

Union is O(n).

a.Union(b).Union(c) is less efficient in most implementations than a.Union(b.Concat(c)) because it creates a hash-set for the first union operation and then another for the second, as other answers have said. Both of these also end up with a chain of IEnumerator<T> objects in use which increases cost as further sources are added.

a.Union(b).Union(c) is more efficient in .NET Core because the second .Union() operation produces a single object with knowledge of a, b and c and it will create a single hash-set for the entire operation, as well as avoiding the chain of IEnumerator<T> objects.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
1

As with everything, this has been changing with .NET Core and now just .NET. With Framework, it was most efficient (by about 20%) to use a chain of Concat operations embedded within a Union:

a.Union(b.Concat(c).Concat(d)...Concat(x))

However, sometime in .NET Core, this changed and now a chained Union is now more efficient (by as much as 75%):

a.Union(b).Union(c).Union(d)...Union(x)

As with everything, it is important to monitor your critical sections and test.

David Beckman
  • 605
  • 8
  • 17