1

I wish to import many files into a database (with custom business logic preventing simple SSIS package use).

High level description of problem:

  • Pull existing sql data into DataTable (~1M rows)
  • Read excel file into 2d array in one chunk
  • validate fields row by row (custom logic)
  • Check for duplicate in existing DataTable
  • Insert into DataRow
  • Bulk insert DataTable into SQL table

Problem with my approach: Each row must be checked for duplicates, I thought a call to remote server to leverage SQL would be too slow, so I opted for LINQ. The query was simple, but the size of the dataset causes it to crawl (90% execution time spent in this query checking the fields).

var existingRows = from row in recordDataTable.AsEnumerable()
                                 where row.Field<int>("Entry") == entry
                                 &&    row.Field<string>("Device") == dev
                                 select row;
bool update = existingRows.Count() > 0;

What other ways might there be to more efficiently check for duplicates?

SeeMoreGain
  • 1,263
  • 16
  • 36
  • My first thought is that you should probably be drop `Count()` in favor of [`Any()`](http://stackoverflow.com/questions/704833/linq-count-returned-results). Then I realize you might be looping this over your entire dataset, in which case you're probably better off with [`Distinct()`](http://stackoverflow.com/questions/1606679/remove-duplicates-in-the-list-using-linq) or [`Union()`](http://msdn.microsoft.com/en-us/library/bb341731(v=vs.110).aspx). – rutter Feb 19 '14 at 00:57

1 Answers1

1

Using linq it will basically do a for loop over your ~1M records every time you check for a duplicate.

You would be better off putting the data into a dictionary so your lookups are against an in memory index.

Daniel
  • 2,744
  • 1
  • 31
  • 41
  • Thanks I was thinking something along those lines, but I was not sure if LINQ had any kind of optimisation for that kind of thing or if it is just a glorified foreach loop – SeeMoreGain Feb 19 '14 at 00:51
  • Ideally I would like the solution to return the actual row from the query, but I guess I could store key value pairs for the row number of the data and then retrieve it manually. Will have a play around and see what gains can be made. – SeeMoreGain Feb 19 '14 at 00:55