I have inherited a poorly designed database table (no primary key or indexes, oversized nvarchar
fields, dates stored as nvarchar
, etc.). This table has roughly 350,000 records. I get handed a list of around 2,000 potentially new records at predefined intervals, and I have to insert any of the potentially new records if the database does not already have a matching record.
I initially tried making comparisons in a foreach
loop, but it quickly became obvious that there was probably a much more efficient way. After doing some research, I then tried the .Any()
, .Contains()
, and .Exclude()
methods.
My research leads me to believe that the .Exclude()
method would be the most efficient, but I get out of memory errors when trying that. The .Any()
and .Contains()
methods seem to both take roughly the same time to complete (which is faster than the foreach
loop).
The structure of the two lists are identical, and each contain multiple strings. I have a few questions that I have not found satisfying answers to, if you don't mind.
- When comparing two lists of objects (made up of several strings), is the
.Exclude()
method considered to be the most efficient? Is there a way to use projection when using the .Exclude() method? What I would like to find a way to accomplish would be something like:
List<Data> storedData = db.Data; List<Data> incomingData = someDataPreviouslyParsed; // No Projection that runs out of memory var newData = incomingData.Exclude(storedData).ToList(); // PsudoCode that I would like to figure out if is possible // First use projection on db so as to not get a bunch of irrelevant data List<Data> storedData = db.Data.Select(x => new { x.field1, x.field2, x.field3 }); var newData = incomingData.Select(x => new { x.field1, x.field2, x.field3 }).Exclude(storedData).ToList();
Using a raw SQL statement in SQL Server Studio Manager, the query takes slightly longer than 10 seconds. Using EF, it seems to take in excess of a minute. Is that poorly optimized SQL by EF, or is that overhead from EF that makes such a difference?
- Would raw SQL in EF be a better practice in a situation like this?
Semi-Off-Topic:
When grabbing the data from the database and storing it in the variable storedData
, does that eliminate the usefulness of any indexes (should there be any) stored in the table?
I hate to ask so many questions, and I'm sure that many (if not all) of them are quite noobish. However, I have nowhere else to turn, and I have been looking for clear answers all day. Any help is very much so appreciated.
UPDATE
After further research, I have found what seems to be a very good solution to this problem. Using EF, I grab the 350,000 records from the database keeping only the columns I need to create a unique record. I then take that data and convert it to a dictionary grouping the kept columns as the key (like can be seen here). This solves the problem of there already being duplicates in the returned data, and gives me something fast to work with to compare my newly parsed data to. The performance increase was very noticeable!
I'm still not sure if this would be approaching the best practice, but I can certainly live with the performance of this. I have also seen some references to ToLookup()
that I may try to get working to see if there is a performance gain there as well. Nevertheless, here is some code to show what I did:
var storedDataDictionary = storedData.GroupBy(k => (k.Field1 + k.Field2 + k.Field3 + k.Field4)).ToDictionary(g => g.Key, g => g.First());
foreach (var item in parsedData)
{
if (storedDataDictionary.ContainsKey(item.Field1 + item.Field2 + item.Field3 + item.Field4))
{
// duplicateData is a previously defined list
duplicateData.Add(item);
}
else
{
// newData is a previously defined list
newData.Add(item);
}
}