I wish to import many files into a database (with custom business logic preventing simple SSIS package use).
High level description of problem:
- Pull existing sql data into DataTable (~1M rows)
- Read excel file into 2d array in one chunk
- validate fields row by row (custom logic)
- Check for duplicate in existing DataTable
- Insert into DataRow
- Bulk insert DataTable into SQL table
Problem with my approach: Each row must be checked for duplicates, I thought a call to remote server to leverage SQL would be too slow, so I opted for LINQ. The query was simple, but the size of the dataset causes it to crawl (90% execution time spent in this query checking the fields).
var existingRows = from row in recordDataTable.AsEnumerable()
where row.Field<int>("Entry") == entry
&& row.Field<string>("Device") == dev
select row;
bool update = existingRows.Count() > 0;
What other ways might there be to more efficiently check for duplicates?