1

Let's say I have two tables in my SQL database. 1. A medium sized table with thousands of records called MyTable1 2. A large table with millions of records (and growing by the day) called MyTable2

MyTable1 and MyTable2 both have a property called Hash that can be equal.

I'm looking to find the most efficient way to use Linq to Entities to iterate over MyTable1 and find all records in MyTable2 that have the same Hash and save into another table. Here's a simplified view of what the code looks like.

using(var db = new context()) {
    var myTable1Records = db.MyTable1.Select(x => x);

    foreach(var record in myTable1Records) {  
        var matches = db.MyTable2.Where(y => y.Hash.Equals(record.Hash)).Select(y => y);
        foreach(var match in matches) {
           // Add match to another table
        }
    }
}

I'm seeing the performance of this code slow down significantly as the size of MyTable2 grows larger every day. A few ideas I'm experimenting with for efficiently handling this type of scenario are:

  1. Setting MergeOption.NoTracking on db.MyTable2 since it's purely a read operation. Haven't seen much of an improvement from this unfortunately.
  2. Pulling MyTable2 into memory using .ToList() to eliminate multiple calls to the db
  3. Creating "chunks" of MyTable2 that the code can iterate over so it's not querying against the full million+ records each time.

I'd love to see if there are other techniques or magic bullets you've found to be effective in this type of scenario. Thanks!

djpark
  • 181
  • 1
  • 10
  • 3
    Do you have an index on your hash colums in the db? – MoZahid Jan 10 '14 at 20:26
  • do you have to bring millions of values to c# to do something that the RDMS can do very efficiently? – G. Stoynev Jan 10 '14 at 20:38
  • 1
    This might be a bit off topic to using Linq, but what about creating a view in your DB with these 2 tables joined based upon the `Hash` field and then just query the joined table? - Let your DB do the work for you on this part... It's what it's been optimized to do! – John Bustos Jan 10 '14 at 20:38

4 Answers4

1

You have a property called Hash. Use it as a hash! Store the first table in a Dictionary keyed by Hash and then iterate through the second table checking for matches in the Dictionary, again keying by Hash.

Or, better yet, use LINQ:

var matches = db.MyTable1.Intersect(db.MyTable2);

If you need to do a custom comparison, create an IEqualityComparer. (I assume you're doing some type of projection and that the Select(x => x) is a placeholder for the purposes of this question.)

Or, better still, this operation might be better off taking place entirely in the database in a stored procedure or view. You're essentially doing a JOIN but using C# to do it. You're incurring the cost of the round trip time from database to your client application for what could possibly all be done on the database server.

MgSam
  • 12,139
  • 19
  • 64
  • 95
1

What you're doing here is performing an inner join. By using a query provider you can even ensure that this work is done on the DB side, rather than in memory within your application; you'll only be pulling down the matching results, no more:

var query = from first in db.MyTable1
    join second in db.MyTable2
    on first.Hash equals second.Hash
    select second;
Servy
  • 202,030
  • 26
  • 332
  • 449
1

I would recommend staying in SQL Server. A view or a clustered index might be the best approach.

Here are a few sources to use to read up on the subject of indexes:

  1. http://www.c-sharpcorner.com/uploadfile/nipuntomar/clustered-index-and-non-clustered-index-in-sql-server/
  2. http://technet.microsoft.com/en-us/library/jj835095.aspx
  3. Should every User Table have a Clustered Index?

And here is a source on SQL Views:

  1. http://technet.microsoft.com/en-us/library/aa214068(v=sql.80).aspx
Community
  • 1
  • 1
Andrew Quaschnick
  • 686
  • 13
  • 26
1

May be indexing your hash column can help. Assuming Hash is a char or varchar type, max length an index can support is 900 bytes.

CREATE NONCLUSTERED INDEX IX_MyTable2_Hash ON dbo.MyTable2(Hash);

For performance of indexing a varchar, you might want to check here SQL indexing on varchar

Community
  • 1
  • 1
ACS
  • 443
  • 4
  • 15