7

There must be an better way to do this, I'm sure...

// Simplified code
var a = new List<int>() { 1, 2, 3, 4, 5, 6 };
var b = new List<int>() { 2, 3, 5, 7, 11 };
var z = new List<int>();
for (int i = 0; i < a.Count; i++)
    if (b.Contains(a[i]))
        z.Add(a[i]);
// (z) contains all of the numbers that are in BOTH (a) and (b), i.e. { 2, 3, 5 }

I don't mind using the above technique, but I want something fast and efficient (I need to compare very large Lists<> multiple times), and this appears to be neither! Any thoughts?

Edit: As it makes a difference - I'm using .NET 4.0, the initial arrays are already sorted and don't contain duplicates.

Vitani
  • 1,594
  • 1
  • 14
  • 28
  • Take a look here: http://stackoverflow.com/questions/649444/testing-equality-of-arrays-in-c-sharp – Leo Feb 27 '12 at 12:47
  • Similar questions: [Efficient set intersection algorithm](http://stackoverflow.com/questions/497338/efficient-set-intersection-algorithm), [Fast intersection of sets: C++ vs C#](http://stackoverflow.com/questions/1060648/fast-intersection-of-sets-c-vs-c-sharp) – Devendra D. Chavan Feb 27 '12 at 12:48

5 Answers5

7

You could use IEnumerable.Intersect.

var z = a.Intersect(b);

which will probably be more efficient than your current solution.

note you left out one important piece of information - whether the lists happen to be ordered or not. If they are then a couple of nested loops that pass over each input array exactly once each may be faster - and a little more fun to write.

Edit In response to your comment on ordering:

first stab at looping - it will need a little tweaking on your behalf but works for your initial data.

    int j = 0;
    foreach (var i in a)
    {
        int x = b[j];
        while (x < i)
        {
            if (x == i)
            {
                z.Add(b[j]);
            }
            j++;
            x = b[j];
        }
    }

this is where you need to add some unit tests ;)

Edit final point - it may well be that Linq can use SortedList to perform this intersection very efficiently, if performance is a concern it is worth testing the various solutions. Dont forget to take the sorting into account if you load your data in an un-ordered manner.

One Final Edit because there has been some to and fro on this and people may be using the above without properly debugging it I am posting a later version here:

        int j = 0;
        int b1 = b[j];
        foreach (var a1 in a)
        {
            while (b1 <= a1)
            {
                if (b1 == a1)
                    z1.Add(b[j]);
                j++;
                if (j >= b.Count)
                    break;
                b1 = b[j];
            }
        }
dice
  • 2,820
  • 1
  • 23
  • 34
  • The initial arrays are ordered – Vitani Feb 27 '12 at 13:10
  • Thank-you for your update, and thanks to Stack over-flow for closing the question so I *can't* post my findings ... sigh. – Vitani Feb 27 '12 at 13:53
  • LIES AND SLANDER! Sorry dice, but I was not resetting the StopWatch before your code, yours is BY FAR the most efficient! It took 15 ticks after I reset the StopWatch. Awesome. – Vitani Feb 27 '12 at 14:16
  • Intersect() seems a bit weird - It takes about 2,750 ticks to "load", then no matter what dataset you throw at it on the second/third/subsequent calls it takes ~1 tick to complete. Your method takes longer on subsequent calls (compared to Intersect), but I'd have to do a LOT of array comparisons to make using Intersect the better choice. (Using my test data, I'd have to do over 250 comparisons - not likely in my scenario - for your method to become slower over-all) – Vitani Feb 27 '12 at 14:57
  • Additionally, I can halve the time your method takes by using an int[] instead of a List – Vitani Feb 27 '12 at 16:22
  • again interesting - note that my first implementation was seriously flawed - see the second one for improvements, also with regard to Intersect behavior see here: http://stackoverflow.com/questions/9467943/in-linq-why-are-subsequent-calls-of-ienumerable-intersect-so-much-faster – dice Feb 27 '12 at 16:33
  • Thanks for the link, I'll be watching that post to see what people say. As of now, it looks like .Intersect() doesn't actually do anything until you "query" *z*, so if we changed the code to include a .Count before the .Stop() we might get different results. I'll give that a go at work tomorrow, and post my results. – Vitani Feb 27 '12 at 19:27
  • Running my tests again, with a .Count() on each test before the .Stop(), your method is 4x faster than the fastest .Intersect() (17 ticks vs 70) – Vitani Feb 28 '12 at 09:21
2

There's IEnumerable.Intersect, but since this is an extension method, I doubt it will be very efficient.

If you want efficiency, take one list and turn it into a Set, then go over the second list and see which elements are in the set. Note that I preallocate z, just to make sure you don't suffer from any reallocations.

var set = new HashSet<int>(a);
var z = new List<int>(Math.Min(set.Count, b.Count));

foreach(int i in b)
{
    if(set.Contains(i))
        a.Add(i);
}

This is guaranteed to run in O(N+M) (N and M being the sizes of the two lists).

Now, you could use set.IntersectWith(b), and I believe it will be just as efficient, but I'm not 100% sure.

zmbq
  • 38,013
  • 14
  • 101
  • 171
  • I don't mean extension methods are inefficient by themselves, I just mean that it is only aware of the IEnumerable interface, making it hard to be efficient (unless they use a Set internally, do they?) – zmbq Feb 27 '12 at 12:56
1

The Intersect() method does just that. From MSDN:

Produces the set intersection of two sequences by using the default equality comparer to compare values.

So in your case:

var z = a.Intersect(b);
DaveShaw
  • 52,123
  • 16
  • 112
  • 141
1

If you can use LINQ, you could use the Enumerable.Intersect() extension method.

liwp
  • 6,746
  • 1
  • 27
  • 39
Omaha
  • 2,262
  • 15
  • 18
1

Use SortedSet<T> in System.Collections.Generic namespace:

SortedSet<int> a = new SortedSet<int>() { 1, 2, 3, 4, 5, 6 };
SortedSet<int> b = new SortedSet<int>() { 2, 3, 5, 7, 11 };
b.IntersectWith(s2);

But surely you have no duplicates!
Although your second list needs not to be a SortedSet. It can be any collection (IEnumerable<T>), but internally the method act in a way that if the second list also is SortedSet<T>, the operation is an O(n) operation.

Mohammad Dehghan
  • 17,853
  • 3
  • 55
  • 72