-1

In this topic Fast count() intersection of two string arrays, Some1.Kill.The.DJ explains a very fast method for calculating the intersection of two arrays of strings.

This is :

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();

In the same thing, how does Union?

Community
  • 1
  • 1
LeMoussel
  • 5,290
  • 12
  • 69
  • 122

1 Answers1

2

Honestly, that method shouldn't be any faster than the original, since it's essentially doing the same thing behind the scenes.

(Edit: It's not even correct in the case where arr1 contains duplicate items that are also in arr2! It's also not always faster: I tried it on short arrays and it was, but tried it on longer arrays and it wasn't.)

A parallel for Union might be

HashSet<string> arr2Set = new HashSet<string>(arr2);
int unionSize = arr2Set.Count + arr1.Where(x=>arr2Set.Add(x)).Count();

but it really should be no different from doing

int unionSize = arr1.Union(arr2).Count();

If you're worried about performance, read this first. By trying this version and timing the two, you could well end up with the original version being faster - I certainly did.

Rawling
  • 49,248
  • 7
  • 89
  • 127
  • "Edit: It's not even correct in the case where arr1 contains duplicate items that are also in arr2!" -- A `HashSet` cannot contain duplicate items, can it? –  Dec 18 '12 at 08:48
  • @hvd No, but the source arrays can. In such a case, `Intersect` (as a set operation) discards all duplicates, but the hashset-version discards duplicates in `arr2` (fine) but counts all duplicates in `arr1` (not fine). – Rawling Dec 18 '12 at 08:50
  • Ah, you're right, I misunderstood the question. Additionally, if duplicate items exist in `arr2`, they won't remain in `arr2Set`. –  Dec 18 '12 at 08:54
  • @hvd They won't remain, but that's OK - `Intersect` is supposed to ignore duplicates. – Rawling Dec 18 '12 at 08:57
  • I was concerned about the case you mentioned, duplicates in `arr1`. It is no longer possible to check whether they were also duplicated in `arr2`. I do not know about `Enumerable.Intersect()` specifically, but generally, I would expect the intersection of [ 1, 2, 2, 3 ] and [ 2, 2, 3, 3 ] to be [ 2, 2, 3 ], not [ 2, 3 ]. –  Dec 18 '12 at 09:01
  • @hvd The LINQ intersect/union/etc. are styled as "set operations", and they discard all duplicates; your kind of intersect is a little trickier to implement, but certainly possible. – Rawling Dec 18 '12 at 09:07