Bulk insert c# datatable to sql server - Efficient check for duplicate

Question

I wish to import many files into a database (with custom business logic preventing simple SSIS package use).

High level description of problem:

Pull existing sql data into DataTable (~1M rows)
Read excel file into 2d array in one chunk
validate fields row by row (custom logic)
Check for duplicate in existing DataTable
Insert into DataRow
Bulk insert DataTable into SQL table

Problem with my approach: Each row must be checked for duplicates, I thought a call to remote server to leverage SQL would be too slow, so I opted for LINQ. The query was simple, but the size of the dataset causes it to crawl (90% execution time spent in this query checking the fields).

var existingRows = from row in recordDataTable.AsEnumerable()
                                 where row.Field<int>("Entry") == entry
                                 &&    row.Field<string>("Device") == dev
                                 select row;
bool update = existingRows.Count() > 0;

What other ways might there be to more efficiently check for duplicates?

My first thought is that you should probably be drop `Count()` in favor of [`Any()`](http://stackoverflow.com/questions/704833/linq-count-returned-results). Then I realize you might be looping this over your entire dataset, in which case you're probably better off with [`Distinct()`](http://stackoverflow.com/questions/1606679/remove-duplicates-in-the-list-using-linq) or [`Union()`](http://msdn.microsoft.com/en-us/library/bb341731(v=vs.110).aspx). — rutter, Feb 19 '14 at 00:57

score 1 · Accepted Answer · answered Feb 19 '14 at 00:47

1

Using linq it will basically do a for loop over your ~1M records every time you check for a duplicate.

You would be better off putting the data into a dictionary so your lookups are against an in memory index.

answered Feb 19 '14 at 00:47

Daniel

2,744
1
31
41

Thanks I was thinking something along those lines, but I was not sure if LINQ had any kind of optimisation for that kind of thing or if it is just a glorified foreach loop – SeeMoreGain Feb 19 '14 at 00:51
Ideally I would like the solution to return the actual row from the query, but I guess I could store key value pairs for the row number of the data and then retrieve it manually. Will have a play around and see what gains can be made. – SeeMoreGain Feb 19 '14 at 00:55

Bulk insert c# datatable to sql server - Efficient check for duplicate

1 Answers1