I need to implement a function to look for similar names. It's for the development of a new system that has migrated the data of a previous system. One of the features would be that on account creation, it would try and look for person records that are there with a similar name and propose them. Examples for similar names for "John Johnson" could be:
- John Johnson
- Jon Jonsen
- John & Jane Johnson-Peters
- Fam. Johnsen
- J. Johnson
To achieve this I've created some SQL functions and functional wise they work:
[dbo].[Levenshtein]: A copy of the top rated answer from this question
[dbo].[SplitString]: Which is to split a name string based on '/', '', '&', '+' and '-'
CREATE FUNCTION [dbo].[SplitString]( @s nvarchar(4000) ) RETURNS @table TABLE ([value] VARCHAR(4000)) WITH SCHEMABINDING AS BEGIN DECLARE @repl VARCHAR(4000) = REPLACE(REPLACE(REPLACE(REPLACE(@s, '/', '-'),'&', '-'),'+', '-'),'\', '-'); INSERT INTO @table SELECT TRIM(value) FROM STRING_SPLIT(@repl, '-', NULL) WHERE RTRIM(value) <> ''; RETURN END
[dbo].[IsSimilar]: Takes 2 strings, calls the split function and checks if any part of the splits are similar, meaning within Levenshtein distance 5, an initial, or 'Fam'
CREATE FUNCTION [dbo].[IsSimilar]( @s nvarchar(4000) , @t nvarchar(4000) ) RETURNS BIT WITH SCHEMABINDING AS BEGIN DECLARE @sT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000)) DECLARE @tT TABLE (idx int IDENTITY(1,1), [value] VARCHAR(4000)) DECLARE @sNrOfWords INT, @tNrOfWords INT, @sCount INT = 1, @tCount INT = 1, @sVal VARCHAR(4000), @tVal VARCHAR(4000) IF (@s = 'fam' OR @s = 'fam.' OR @t = 'fam' OR @t = 'fam.') return 1 INSERT INTO @sT SELECT [value] FROM [dbo].[SplitString](@s) INSERT INTO @tT SELECT [value] FROM [dbo].[SplitString](@t) SET @sNrOfWords = (SELECT COUNT([value]) FROM @sT) SET @tNrOfWords = (SELECT COUNT([value]) FROM @tT) IF (@sNrOfWords > 0 AND @tNrOfWords > 0) BEGIN WHILE (@sCount <= (SELECT MAX(idx) FROM @sT)) BEGIN SET @sVal = (SELECT [value] FROM @sT WHERE idx = @sCount) WHILE (@tCount <= (SELECT MAX(idx) FROM @tT)) BEGIN SET @tVal = (SELECT [value] FROM @tT WHERE idx = @tCount) IF (((LEN(@sVal) = 1 OR LEN(@tVal) = 1 OR SUBSTRING(@sVal, 2, 1) IN ('.', ' ') OR SUBSTRING(@tVal, 2, 1) IN ('.', ' ')) AND SUBSTRING(@sVal, 1, 1) = SUBSTRING(@tVal, 1, 1)) OR ((SELECT [dbo].[Levenshtein](@sVal,@tVal, 5)) IS NOT NULL)) RETURN 1 SET @tCount = @tCount + 1 END SET @sCount = @sCount + 1 SET @tCount = 1 END END RETURN 0 END
Sadly this solution isn't the most performant :(
It takes 12-13s to find this record based on my mispelled name, which off course is too long.
The full table is only 512 records at the moment.
Any help on getting this more performant? I know looping isn't recomended in SQL, so probably something to gain there. I'm not a DBA or SQL specialist, no idea how to write that differently without the loops. Didn't think I could use a join as there's no equality.