How to make this IsSimilar(varchar,varchar) function more performant

Question

I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:

John Johnson
Jon Jonsen
John & Jane Johnson-Peters
Fam. Johnsen
J. Johnson

To achieve this I've created some SQL functions and functional wise they work:

[dbo].[Levenshtein]: A copy of the top rated answer from this question

[dbo].[SplitString]: Which is to split a name string based on '/', '', '&', '+' and '-'

 CREATE FUNCTION [dbo].[SplitString](
     @s nvarchar(4000)
 )
 RETURNS @table TABLE ([value] VARCHAR(4000))
 WITH SCHEMABINDING
 AS
 BEGIN
     DECLARE @repl VARCHAR(4000) = REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-');
     INSERT INTO @table 
     SELECT TRIM(value) FROM STRING_SPLIT(@repl, '-', NULL)  
     WHERE RTRIM(value) <> '';

     RETURN
 END

[dbo].[IsSimilar]: Takes 2 strings, calls the split function and checks if any part of the splits are similar, meaning within Levenshtein distance 5, an initial, or 'Fam'

 CREATE FUNCTION [dbo].[IsSimilar](
     @s nvarchar(4000)
   , @t nvarchar(4000)
 )
 RETURNS BIT
 WITH SCHEMABINDING
 AS
 BEGIN
     DECLARE @sT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
     DECLARE @tT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000))
     DECLARE @sNrOfWords INT,
             @tNrOfWords INT,
             @sCount INT = 1,
             @tCount INT = 1,
             @sVal VARCHAR(4000),
             @tVal VARCHAR(4000)

     IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
         return 1

     INSERT INTO @sT SELECT [value] FROM [dbo].[SplitString](@s)
     INSERT INTO @tT SELECT [value] FROM [dbo].[SplitString](@t)

     SET @sNrOfWords = (SELECT COUNT([value]) FROM @sT)
     SET @tNrOfWords = (SELECT COUNT([value]) FROM @tT)

     IF (@sNrOfWords > 0 AND @tNrOfWords > 0)
     BEGIN
         WHILE (@sCount <= (SELECT MAX(idx) FROM @sT))
         BEGIN           
             SET @sVal = (SELECT [value] FROM @sT WHERE idx = @sCount)
             WHILE (@tCount <= (SELECT MAX(idx) FROM @tT))
             BEGIN
                 SET @tVal = (SELECT [value] FROM @tT WHERE idx = @tCount)
                 IF (((LEN(@sVal) = 1 OR LEN(@tVal) = 1 OR SUBSTRING(@sVal, 2, 1) IN ('.', ' ') OR SUBSTRING(@tVal, 2, 1) IN ('.', ' ')) AND SUBSTRING(@sVal, 1, 1) = SUBSTRING(@tVal, 1, 1)) OR ((SELECT [dbo].[Levenshtein](@sVal,@tVal, 5)) IS NOT NULL))
                     RETURN 1
                 SET @tCount = @tCount + 1
             END
             SET @sCount = @sCount + 1
             SET @tCount = 1
         END
     END
     RETURN 0
 END

Sadly this solution isn't the most performant :( It takes 12-13s to find this record based on my mispelled name, which off course is too long. The full table is only 512 records at the moment.

Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.

Converting both of your functions to be inline table-value functions would be a great start. Multi-line table value functions are known to perform poorly, multi-line scalar functions *can* perform poorly, and `WHILE` loops *do* perform terribly. — Thom A, Feb 08 '22 at 09:52
What you ask for is fuzzy matching, and unless you can use an appropriate index, you *can't* make it fast. SQL queries run fast when they can use indexes. I don't think any database product can accelerate the calculation of Levenshtein distances, unless they use special index types. — Panagiotis Kanavos, Feb 08 '22 at 09:54
Calculating similarity is typically the job of search engines. Even Full Text Search doesn't calculate similarities. Elastic *does* calculate similarities. Another possibility is to use a Python package with `sp_execute_external_script` to calculate the Levenshtein distances for all rows in a query result set. — Panagiotis Kanavos, Feb 08 '22 at 09:56
@PanagiotisKanavos is right though, converting to more performant objects/code isn't going to make the actual lookups fast. It may well get faster (as I suspect the "multi-lineness" and the `WHILE` are trashing performance more), but it'll likely never be "fast"; not with a good amount of data anyway. — Thom A, Feb 08 '22 at 09:56
There is a way to use B-Trees to accelerate Levenshtein distances but it uses a *ton* of space - precalculate all combinations of the existing data up to a distance of eg 2-3, store them in a table and index them. Obviously, that takes a *lot* of space. When you want to search, calculate the alternatives for the max distance you want and search the precalculated combinations for exact matches. This operation is *very* fast because it uses the index, but obviously the index itself will be several times larger than the original data — Panagiotis Kanavos, Feb 08 '22 at 10:00
An SQLCLR function would probably be significantly faster. Also you should precalculate those splits and store them in a table — Charlieface, Feb 08 '22 at 10:06
Can I put an index on those columns? From the examples alone I would have 3 "Johnson" entries in the LastName column. Also @Larnu, any example on how to rewrite that table value function and while loops — FoxHound, Feb 08 '22 at 10:09
Knowing how to convert to set based methodology isn't something you can give examples for, it's something you need to go out and learn. The documentation contains an [example](https://learn.microsoft.com/en-us/sql/t-sql/statements/create-function-transact-sql?view=sql-server-ver15#b-creating-an-inline-table-valued-function) of an inline table value function. Note the lack of `BEGIN...END` and that it consists of a single statement. — Thom A, Feb 08 '22 at 10:26
@FoxHound an index won't help at all. Indexes can be used for exact searches or prefix searches, while you try to find the distance between *every* word in your table and the input word. For short distances, eg 1-2, perhaps you could calculate the combinations of your input word and search for all of them with eg an `IN` clause. That would result in a lot of combinations though - for a word like `potato` and a distance of 1 you'd have to calculate the 5 alternatives with a single omission, 24^6 substitutions, 25^7 for single addition — Panagiotis Kanavos, Feb 08 '22 at 10:41
@Charlieface in the general case (not 512 words) a SQLCLR would *not* be significantly faster without proper indexing, because it would still have to scan the entire table and calculate the distance for all words every time. It *could* be faster if a trie was used to index the words. Unfortunately, the way SQLCLR works, the trie would have to be loaded every time the method was called. Updating it could cause concurrency problems. In the end it wouldn't be a lot better than using an external index or package. Python already has such packages, and all supported SQL Server versions can use it — Panagiotis Kanavos, Feb 08 '22 at 10:46
I'm going to see what gain I get from rewriting the Split function to a single line. Is it possible to write the WHILE loops like "EXIST (SELECT * FROM [@sT] INNER JOIN [@tT] ON (((LEN([@sVal]) = ... IS NOT NULL))" (the large IF clause deep in the loops)? And would that increase performance? — FoxHound, Feb 08 '22 at 10:56
Not really. You're working with two strings, not indexed values in a table. `idx` isn't indexed, but unless there are a lot of words in the string, the cost of indexing that field will probably be greater than any benefit. It would be better if you used a SQLCLR function to calculate the Levenshtein distance between two strings. It would be a *lot* better if you didn't try to do this in T-SQL at all. — Panagiotis Kanavos, Feb 08 '22 at 11:02
If it'd be more performant to pump over the data and do the lookup in the application, I'll rewrite the functions in code. In the end it'll only have to run on records with AccountID = NULL, which won't be the case with newly added Person records anyway. — FoxHound, Feb 08 '22 at 11:20

score 0 · Accepted Answer · answered Feb 08 '22 at 13:15

After implementing the suggestions in the comments on the OP, I managed to cut down the same SELECT statement from 12-13s to about 1s, which is a lot more acceptable.

The SplitString has been changed to an inline function:

Create FUNCTION [dbo].[SplitString](
    @s nvarchar(4000)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN(
    SELECT TRIM(value) AS value FROM STRING_SPLIT(REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-'), '-', NULL)  
    WHERE RTRIM(value) <> ''
);

Cutting down on the variables and using a Join statement for the IsSimilar function gives a boost as well:

CREATE FUNCTION [dbo].[IsSimilar](
    @s nvarchar(4000)
    , @t nvarchar(4000)
)
RETURNS BIT
AS
BEGIN

    IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.')
        return 1

    IF (LEN(TRIM(@s)) > 0 AND LEN(TRIM(@t)) > 0)
    BEGIN
        RETURN (SELECT IIF (EXISTS(SELECT [sT].[value] FROM (SELECT [value] FROM [dbo].[SplitString](@s)) AS sT INNER JOIN (SELECT [value] FROM [dbo].[SplitString](@t)) AS tT ON (((LEN([sT].[value]) = 1 OR LEN([tT].[value]) = 1 OR SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND SUBSTRING([sT].[value], 1, 1) = SUBSTRING([tT].[value], 1, 1)) OR (NOT(SUBSTRING([sT].[value], 2, 1) IN ('.', ' ') OR SUBSTRING([tT].[value], 2, 1) IN ('.', ' ')) AND (SELECT [dbo].[Levenshtein]([sT].[value],[tT].[value], 5)) IS NOT NULL)) ), 1, 0))
    END
    RETURN 0
END

I don't know how much this boost will hold up to real big data, but in this case that'll not be the case as Person records get linked to Account records with every new account creation and only Person records with AccountID = NULL will be considered.

How to make this IsSimilar(varchar,varchar) function more performant

1 Answers1