Fast count() intersection of two string arrays

Question

I need to count the number of elements corresponding to the intersection of two big arrays of strings and do it very fast.

I am using the following code:

arr1[i].Intersect(arr2[j]).Count()

For CPU Time, VS Profiler indicates

85.1% in System.Linq.Enumerable.Count()
0.3% in System.Linq.Enumerable.Intersect()

Unfortunately it might to take hours to do all work.

How to do it faster?

The numbers you got from the profiler are propably not "correct". Because the Intersect is not executed when you say .Intersect(), the whole query is exeuted when you say .Count(). That's the nature of LINQ. I suspect there is more work to be done when intersecting than when counting. If you really need performance on this one, try to do it without LINQ. — cpt. jazz, Dec 16 '12 at 10:52
If it's big enough, put it in a database, or create a cluster of computers/threads, maybe do some MapReduce.. — Kieren Johnstone, Dec 16 '12 at 10:56
Are you intersecting the strings in the `arr1` and `arr2` or each character from each string in `arr1` vs. each character from each string in `arr2`? — Alex Filipovici, Dec 16 '12 at 11:03
I hardly think this is your bottleneck. In my tests Intersect+Count on 2 arrays of 5 millions of strings (avg.length=60 char) takes ~3.5 seconds... How big are your arrays ? — digEmAll, Dec 16 '12 at 12:11

Anirudha · Accepted Answer · 2012-12-16T11:16:18.073

4

You can use HashSet with arr2

HashSet<string> arr2Set = new HashSet<string>(arr2);
arr1.Where(x=>arr2Set.Contains(x)).Count();
              ------------------
                      |
                      |->HashSet's contains method executes quickly using hash-based lookup..

Not considering the conversion from arr2 to arr2Set ,this should be O(n)

edited Dec 16 '12 at 11:16

answered Dec 16 '12 at 11:10

Anirudha

32,393
7
68
89

This is the Best Way ! the most efficient. This method is for Instersect. Is it possible to use the same principle for Union? – LeMoussel Dec 17 '12 at 06:30
This gives incorrect answers in the case where `arr1` contains duplicate strings which also have duplicates in `arr2`. You'd need `Where(x => arr2Set.Remove(x))`. – Rawling Dec 18 '12 at 08:49
Also, this `Contains` version ends up _slower_ if the arrays get larger - but the version with `Remove` _does_ seem to stay faster. – Rawling Dec 18 '12 at 08:57
2

Just a syntax improvement - `arr1.Where(x=>arr2Set.Contains(x)).Count();` can be replaced with `arr1.Count(arr2Set.Contains);` – David S. Aug 18 '15 at 09:09

score 1 · Answer 2 · answered Dec 16 '12 at 10:53

I suspect the reason why the profiler shows the time being consumed in Count, is that this is where the collection is actually enumerated (the Intersect is lazily evaluated and does not run before you need the result).

I believe Intersect should have some internal optimizations to make this reasonably fast, but you could try using a HashSet<string> so you are sure the intersect can be made without searching through the inner array for each element:

HashSet<string> set = new HashSet<string>(arr1);
set.IntersectWith(arr2);
int count = set.Count;

Bizarrely, in my test this ended up slower than both the original and my corrected version of `Some1`'s answer, when I would've expected it to be great. — Rawling, Dec 18 '12 at 09:01

Nahum · Answer 3 · 2012-12-16T11:28:58.837

Hmmm Intersect is probably N^2

to make it faster quicksort both arrays. and than traverse both arrays. counting intersections.

too lazy to test how fast it would be but should O(nlogn +n)

    using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Test
{
    class Program
    {
        static void Main(string[] args)
        {
            const int arrsize = 1000000;
            Random rnd = new Random(42);
            string[] arr1 = new string[arrsize];
            string[] arr2 = new string[arrsize];
            for (int i = 0; i < arrsize; i++)
            {
                arr1[i] = rnd.Next().ToString();
                arr2[i] = rnd.Next().ToString();
            }
            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                arr1.Intersect(arr2).Count();
                Console.WriteLine("array" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

        {

            HashSet<string> set = new HashSet<string>(arr1);
            var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
            set.IntersectWith(arr2);
            int count = set.Count;
            Console.WriteLine("HashSet" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
        }
            {
               var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                HashSet<string> set = new HashSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("HashSet + new" + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                SortedSet<string> set = new SortedSet<string>(arr1);
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet +new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }

            {

                SortedSet<string> set = new SortedSet<string>(arr1);
                var stamp = (System.Diagnostics.Stopwatch.GetTimestamp());
                set.IntersectWith(arr2);
                int count = set.Count;
                Console.WriteLine("SortedSet without new " + (System.Diagnostics.Stopwatch.GetTimestamp() - stamp));
            }
        }
    }
}

results

array 914,637

HashSet 816,119

HashSet +new 1,150,978

SortedSet +new 16,173,836

SortedSet without new 7,946,709

so seems that best way is to keep a ready hash set.

Intersect in HashSet is also implemented O(n log n), just looked in the disassembly — user287107, Dec 16 '12 at 11:10
@user287107: that's not true. Both Intersect and HashSet use hash tables and so hashset is asymptotically `O(1)` and intersect is `O(n)`. Of course sometimes `O(logn)` complexity could be faster (it depends on gethashcode implementation, buckets and other factor), but they're not implemented using binarytrees or something with `O(logn)` complexity. — digEmAll, Dec 16 '12 at 11:14
@user287107: http://c-sharp-snippets.blogspot.it/2010/03/runtime-complexity-of-net-generic.html — digEmAll, Dec 16 '12 at 11:18
I know it could be faster, but just take a look at the implementation of the IntersectIterator class. the iterator does the following 1) create new set, 2) iterate through source1, add items to set 3) iterate through source2, if remove is successfull, yield return the element. the adding and removing is O(log n), and by the loop it is O(n log n) — user287107, Dec 16 '12 at 11:19
@user287107: `Linq.Enumerable.Set` add and remove are `O(1)` operations (as for the `HashSet<>`), and by the loop is `O(n * 1) --> O(n)` — digEmAll, Dec 16 '12 at 11:38
ok you're right, add and remove are **typical** O(1) operations, but that also implies that in the result it is not guaranteed O(n) — user287107, Dec 16 '12 at 11:47
@user287107: yes, that's correct. "Big O" complexity means asymptotic behaviour, so as I said before methods with `O(logn)` can be faster sometimes. For example in case of many hash collisions... — digEmAll, Dec 16 '12 at 11:56

user287107 · Answer 4 · 2012-12-16T11:11:40.623

when you are working with sets, your complexity will be O((n log n)*(m log m)) or so,

i think this here should be faster, but i'm not sure if it is now O((n log n)+(m log m))

possible would be 
var Set1 = arr1[i].Distinct().ToArray(); // if necessary, if arr1 or arr2 could be not distinct
var Set2 = arr2[j].Distinct().ToArray();  

nCount = Set1.Count() + Set2.Count() - Set1.Append(Set2).Distinct().Count();

score 0 · Answer 5 · answered Dec 16 '12 at 11:33

0

Build a HashSet using the smaller array and then loop through the bigger one, incrementing a counter if the item it exists in the hashset.

answered Dec 16 '12 at 11:33

Eren Ersönmez

38,383
7
71
92

Fast count() intersection of two string arrays

5 Answers5

Linked