1

Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).

Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).

I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.

What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?

update NewRecords
set notmatchflag=true
from 
newRecords elr
inner join People pro
on 
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and 
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob 
Community
  • 1
  • 1
bernie2436
  • 22,841
  • 49
  • 151
  • 244
  • Have you tried a C# CLR stored procedure? Those are much better at string manipulation than SQL user-defined functions. – Andomar Apr 29 '13 at 13:53
  • 1
    Scalar UDFs have disappointingly poor performance on SQL Server. Plus most functions and expressions in a search clause will prevent it from being able to use an index for that part of the search. If you show us the code for your UDf, we may be able to transform it into something that will optimize better. – RBarryYoung Apr 29 '13 at 13:57

3 Answers3

3

Due to not know exactly the table structure and types of data I'm not 100% sure that this would work, but give it a go anyway!

I would first check the SQL execution plan when you test it, usually there will be some sections that will be taking the most amount of time. From there you should be able to gauge where/if any indexes would help.

My gut feeling though is your function that is being called A LOT by the looks of things, but hopefully the execution plan will determine if that is the case. If it is then a CLR stored procedure might be the way to go.

XN16
  • 5,679
  • 15
  • 48
  • 72
1

There seems to be nothing wrong with your query (maybe except for the fact, that you want to find all possible combinations of differing values, which in most scenarios will give a lot of results :)).

Anyway, the problem is your Levenstein functions - I assume that they're written in T-SQL. Even if you optimized them, they're still weak. You really should compile them to CLR (the link that you posted already contain an example) - this will be an order of magnitude faster.

Another idea I'd try with what you've got, is to somehow decrease the number of Levenstein comparisons. Maybe find other conditions, or reverse the query: find all MATCHING records, and then select what's left (it may enable you to introduce those additional conditions.

But Levenstein compiled to CLR is your best option.

AdamL
  • 12,421
  • 5
  • 50
  • 74
1

For one skip the values that are true so if you run it again then it will not process those.
That distance is expensive - try to eliminate those that don't have a chance first.
If the length differs by more than 2 then I don't think you can have a distance <= 2.

update NewRecords
set notmatchflag=true
from  newRecords elr
inner join People pro
  on notmatchflag = false
 and elr.ssn is null 
 and elr.dob <> ro.dob
 and dbo.Levenshtein_withThreshold( elr.last, pro.last,2) > 2  
 and dbo.Levenshtein_withThreshold(elr.first,pro.first,2) > 2 
paparazzo
  • 44,497
  • 23
  • 105
  • 176