Compare very large lists of database objects in c#

Question

I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar fields, dates stored as nvarchar, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.

I initially tried making comparisons in a foreach loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any(), .Contains(), and .Exclude() methods.

My research leads me to believe that the .Exclude() method would be the most efficient, but I get out of memory errors when trying that. The .Any() and .Contains() methods seem to both take roughly the same time to complete (which is faster than the foreach loop).

The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.

When comparing two lists of objects (made up of several strings), is the .Exclude() method considered to be the most efficient?

Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:

List<Data> storedData = db.Data;
List<Data> incomingData = someDataPreviouslyParsed;

// No Projection that runs out of memory
var newData = incomingData.Exclude(storedData).ToList();

// PsudoCode that I would like to figure out if is possible
// First use projection on db so as to not get a bunch of irrelevant data
List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 });
var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();

Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
Would raw SQL in EF be a better practice in a situation like this?

Semi-Off-Topic: When grabbing the data from the database and storing it in the variable storedData, does that eliminate the usefulness of any indexes (should there be any) stored in the table?

I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.

UPDATE

After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!

I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup() that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:

var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());

foreach (var item in parsedData)
{
    if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
    {
        // duplicateData is a previously defined list
        duplicateData.Add(item);
    }
    else
    {
        // newData is a previously defined list
        newData.Add(item);
    }
}

`350,000 records` doesn't seem to be large. Take all of the data to memory and do your work with linq2objects....(of course if it is only one time job) — Eser, Sep 15 '15 at 21:14
I have actually gotten the query running correctly using several of the methods outlined above. I am more so looking for what might be considered the most efficient or the best practice. — FlipperBizkut, Sep 15 '15 at 21:49
This depends on your skills... I'd do such a job with SQL. An RDBMS is perfectly designed to deal with data. If you are allowed to change the structure, I'd invest some thinking in a good table design and move the ugly data into the new structure first. If you must stick with that structure still an RDBMS is better fitting to deal with data bigger than your RAM. — Shnugo, Sep 15 '15 at 22:20

score 0 · Answer 1 · answered Sep 15 '15 at 21:23

0

No reason to use EF for that.
Grab only columns that are required for you to make decision if you should update or insert the record (so those which represent missing "primary key"). Don't waste memory for other columns.
Build a HashSet of existing primary keys (i.e. if primary key is a number, HashSet of int, if it has multiple keys - combine them to string).
Check your 2000 items against HashSet, that is very fast.
Update or insert items with raw sql.

answered Sep 15 '15 at 21:23

Evk

98,527
8
141
191

I should state that I'm using ASP.NET and using EF as my ORM. Therefor, I'm grabbing my data from the DB using EF. Not sure if that was apparent or not. For #3 above, are you saying that I should grab the 4 columns that make a record unique, concatenate them to form a single string, and then create a HashSet of that string? What happens if there are already duplicate records in the DB (which there are undoubtedly)? Will creating the HashSet fail? For comparing against the HashSet, would one use a foreach loop? Otherwise, how would one know which record to insert if not a match? – FlipperBizkut Sep 15 '15 at 21:44
At least for this particular task you would better avoid EF. – Evk Sep 15 '15 at 21:45

score 0 · Answer 2 · answered Sep 15 '15 at 21:45

0

I suggest you consider doing it in SQL, not C#. You don't say what RDBMS you are using, but you could look at the MERGE statement, e.g. (for SQL Server 2008): https://technet.microsoft.com/en-us/library/bb522522%28v=sql.105%29.aspx

Broadly, the statement checks if a record is 'new' - if so, you can INSERT it; if not there is UPDATE and DELETE capabilities, or you just ignore it.

answered Sep 15 '15 at 21:45

Peter Bill

508
3
12

I'm using SQL Server 2012. Ultimately, this is part of a web app which has access to the DB server. Getting direct access to the DB server to run SQL commands (through SQL Server Studio Manager or the such) is not going to be possible under the current conditions. – FlipperBizkut Sep 15 '15 at 22:23
You don't need SSMS, you would just use whatever you would have used for the inserts (e.g. Entity Framework or table adapters). SQL Server is optimised for set operations. If possible, write a stored procedure to include the MERGE statement. The parameter is your list of potentially new records. [This site](http://www.c-sharpcorner.com/UploadFile/vendettamit/entity-framework-4-and-tablevalue-type-in-storeprocedurefix/) contains an example. – Peter Bill Sep 17 '15 at 07:37

Compare very large lists of database objects in c#

2 Answers2