5

I have this query:

var newComponents = from ic in importedComponents
                    where !existingComponents.Contains(ic)
                    select ic;

importedComponents and existingComponents are of type List<ImportedComponent>, and exist only in memory (are not tied to a data context). In this instance, importedComponents has just over 6,100 items, and existingComponents has 511 items.

This statement is taking too long to complete (I don't know how long, I stop the script after 20 minutes). I've tried the following with no improvement in execution speed:

var existingComponentIDs = from ec in existingComponents
                           select ec.ID;

var newComponents = from ic in importedComponents
                    where !existingComponentIDs.Contains(ic.ID)
                    select ic;

Any help will be much appreciated.

Albert Bori
  • 9,832
  • 10
  • 51
  • 78

3 Answers3

5

The problem is quadratic complexity of this algorithm. Put the IDs of all existingComponentIDs into a HashSet and use the HashSet.Contains method. It has O(1) lookup cost compared to O(N) for Contains/Any on a list.

The morelinq project contains a method that does all of that in one convenient step: ExceptBy.

usr
  • 168,620
  • 35
  • 240
  • 369
4

You could use Except to get the set difference:

var existingComponentIDs = existingComponents.Select(c => c.ID); 
var importedComponentIDs = importedComponents.Select(c => c.ID);
var newComponentIDs = importedComponentIDs.Except(existingComponentIDs);
var newComponents = from ic in importedComponents
        join newID in newComponentIDs on ic.ID equals newID
        select ic;
foreach (var c in newComponents)
{ 
    // insert into database?
}

Why is LINQ JOIN so much faster than linking with WHERE?

In short: Join method can set up a hash table to use as an index to quicky zip two tables together

Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • Could also be made into a reusable function ExceptBy. – usr Apr 21 '12 at 00:30
  • @usr: [`Except`](http://msdn.microsoft.com/en-us/library/bb300779.aspx) is already a reusable method which accepts any kind of type and can be used with the default or a [custom equality comparer](http://msdn.microsoft.com/en-us/library/bb336390.aspx), what more could you need? It's also nice that you can [benefit from it's deferred execution](http://stackoverflow.com/a/7324095/284240)(as i've done above). – Tim Schmelter Apr 21 '12 at 12:23
  • ExceptBy would work like MinBy vs. Min. It saves you the projection step and does everything you did in this code example in one step. – usr Apr 21 '12 at 13:20
  • @user: Ok, then you can use Jon Skeet's [morelinq](http://code.google.com/p/morelinq/source/browse/trunk/MoreLinq/ExceptBy.cs?r=173). – Tim Schmelter Apr 21 '12 at 13:45
  • As I am actually doing that I added this tip to my answer. Good suggestion! – usr Apr 21 '12 at 13:47
  • I tried `var newComponents = importedComponents.Except(existingComponents);` and it worked perfectly. @usr 's answer is also correct, but I'm marking this as the best answer because it was the simplest and most direct fix. His may still be more efficient, but I'm not educated enough to judge that. – Albert Bori Apr 21 '12 at 14:52
  • 1
    Please note that you are using reference comparison, then. This approach breaks down if two different objects can have the same id. I'd make sure this is not the case. – usr Apr 21 '12 at 14:56
  • @AlbertBori: I'd be interested in knowing how long this approach takes against your 20 minutes. This does not compare references but ID's which are 1. mostly identifiers and 2. presumably `ints`. – Tim Schmelter Apr 21 '12 at 15:20
  • 1
    @TimSchmelter The `Except()` query took 00:00:00.0030001. Quite an improvement :D. – Albert Bori Apr 21 '12 at 16:45
0

Well based on the logic and numbers you provided that means you are basically performing 3117100 comparisons when you run that statement. Obviously that is not entirely accurate because your condition may be satisfied before running through the entire array but you get my point.

With collections this large you are going to want use a collection where you can index your key (in this case your component ID) to help reduce the overhead of the search. The thing to remember is that even though LINQ looks like SQL there are no magic indexes here; it is mainly for convenience. In fact, I have seen articles where a link lookup is actually a slight bit slower than a brute force lookup.

EDIT: If it is possible I would suggest trying a Dictionary or SortedList for your values. I believe either one would have slightly better lookup performance.

Shawn Lehner
  • 1,293
  • 7
  • 14