How can I speed up a dupe check before insert?

Question

I have written a piece of lab processing software that takes incoming data, breaks it up, and sends it to the proper fields in a database. The DB table is really big and there are a lot of fields. Before I insert the data, I do a dupe check (we get repeat data from multiple sources) before inserting it.

The system works by parsing the incoming data, filling a linq-to-sql object. The dupe check works by performing a .where operation on a list of linq-to-sql objects to selects objects not yet in the table.

For instance ...

input=list (of TableA) 'linq to sql objects
output=input.where(function (x as TableA) not myDb.TableA.any(function(l as table) l.name=x.name, l.dob=x.dob..etc for 10 fields..).tolist

The syntax is a little arcane, but this was the only way I could find to perform an inner join on linq to sql objects and database records. Before I used this method, I went row-by-row performing a dupe check, which was way slower.

As I understand it, Linq is converting this linq statement to a sql statement that is running on the server.

My question is: is there any way to get this to run any faster? is there any reason to expect that writing out a sql statement to do the deduping and running a traditional query would go any faster? This statement is SLOW but it works, is unit tested and blocks the dupes. I am looking for something that is faster and equivalently clean (not hard, I know) or cleaner...

Although not a direct answer to your problem, one technique I have used for a similar (although not identical) use case is to write a stored procedure which does the appropriate create/update/ignore decision. This moves all of the logic server side, so that only one SQL communication (the stored procedure invocation) is executed by the LINQ. This will help if you are network bound. — SAJ14SAJ, Nov 29 '12 at 23:49
@SAJ14SAJ I don't totally understand. Doesn't the linq query get translated to a single sql statement passed to the server? We are running on a slow network so if that is not the case, then that might be the hold up (aside from just a big db) — bernie2436, Nov 29 '12 at 23:52
The short answer is: it depends. You can find out exactly what it is doing by passing a `TextWriter` to the `Log` attribute when you construct your DataContext. But depending on circumstance, it may be doing its own "does this row require update" check before it commits check on insert. Your where clause will certainly do a query, and LINQ then always does another SQL command for each and every insert pending, one by one. There is a lot of information on the web for "LINQ sql optimization" queries. — SAJ14SAJ, Nov 29 '12 at 23:56
Assuming this is SQL Server, I suggest you capture the SQL Statement in SQL Profiler and see what it is generating. I suspect that it is doing work client side, not in the database, so I support SAJ14SAJ's suggestion. — Nick.Mc, Nov 30 '12 at 01:11

score 1 · Answer 1 · answered Nov 29 '12 at 23:52

1

You can define a unique index on your table base on name, dob and .... Each insert, could be successfully done or raise a unique constraint violation exception. So you won't need any check before insert. I think it's the most straight way of doing it.

answered Nov 29 '12 at 23:52

Farshid Zaker

1,960
2
22
39

Certainly appropriate indexing will help. – SAJ14SAJ Nov 29 '12 at 23:57
what do other people think of this? I think this is the cleanest because it (1) really prevents dupes and (2) eliminates a dupe check operation=less code. On the other hand, it would be using errors to direct flow control--which I've heard is a no no – bernie2436 Nov 30 '12 at 14:26
A good discussion about this subject could be found here: http://stackoverflow.com/questions/1736146/why-is-exception-handling-bad – Farshid Zaker Nov 30 '12 at 19:37

paparazzo · Answer 2 · 2012-11-30T22:08:59.670

0

If no other processes are adding to the SQL table then you could read the table into a HashSet when you start the program. Check the local HashSet. If not in the HashSet then add it to the SQL Table and HashSet. HashSet lookup is about 100 time faster than a SQL query even running on the same physical box. I use this for some large loads.

If you are only going to get dups over a short time period then you could skip loading historic from the table at start or only load the last X. Check that HashSet and only if not found in the HashSet then use a SP to insert or skip. And periodically truncate the HashSet to X.

edited Nov 30 '12 at 22:08

answered Nov 30 '12 at 22:01

paparazzo

44,497
23
105
176

I like this idea -- but the table might be too big. I will play around with it when back at work. – bernie2436 Nov 30 '12 at 22:50
If the table is large then just do the last X. If HashSet finds 90% of the dubs then you have eliminated 90% the dup checks to SQL. – paparazzo Nov 30 '12 at 22:55
By 'only load the last x' do you mean only load the last x most recently added entries to the table? Won't I still have to check the rest of the table for dupes? – bernie2436 Nov 30 '12 at 23:34
I don't know how to be any more clear than my last comment. If you find a dup in the HashSet then you don't have to check SQL. Based on repeat real time data I assumed the likelihood of old dups to be low so if you checked the last X you could still catch most. – paparazzo Nov 30 '12 at 23:48

How can I speed up a dupe check before insert?

2 Answers2