2

I am using SQL Server 2008 R2 SP1.

I have a table with about 36034 records of customers. I am trying to implement Fuzy search on Customer Name field.

Here is Function for Fuzzy Search

ALTER FUNCTION [Party].[FuzySearch]
    (
      @Reference VARCHAR(200) ,
      @Target VARCHAR(200)
    )
RETURNS DECIMAL(5, 2)
    WITH SCHEMABINDING
AS
    BEGIN 
        DECLARE @score DECIMAL(5, 2) 
        SELECT  @score = CASE WHEN @Reference = @Target
                              THEN CAST(100 AS NUMERIC(5, 2))
                              WHEN @Reference IS NULL
                                   OR @Target IS NULL
                              THEN CAST(0 AS NUMERIC(5, 2))
                              ELSE ( SELECT [Score %] = CAST(SUM(LetterScore)
                                            * 100.0 / MAX(WordLength
                                                          * WordLength) AS NUMERIC(5,
                                                              2))
                                     FROM   ( -- do
                                              SELECT    seq = t1.n ,
                                                        ref.Letter ,
                                                        v.WordLength ,
                                                        LetterScore = v.WordLength
                                                        - ISNULL(MIN(tgt.n),
                                                              v.WordLength)
                                              FROM      ( -- v
                                                          SELECT
                                                              Reference = LEFT(@Reference
                                                              + REPLICATE('_',
                                                              WordLength),
                                                              WordLength) ,
                                                              Target = LEFT(@Target
                                                              + REPLICATE('_',
                                                              WordLength),
                                                              WordLength) ,
                                                              WordLength = WordLength
                                                          FROM
                                                              ( -- di
                                                              SELECT
                                                              WordLength = MAX(WordLength)
                                                              FROM
                                                              ( VALUES
                                                              ( DATALENGTH(@Reference)),
                                                              ( DATALENGTH(@Target)) ) d ( WordLength )
                                                              ) di
                                                        ) v
                                                        CROSS APPLY ( -- t1
                                                              SELECT TOP ( WordLength )
                                                              n
                                                              FROM
                                                              ( VALUES ( 1),
                                                              ( 2), ( 3), ( 4),
                                                              ( 5), ( 6), ( 7),
                                                              ( 8), ( 9),
                                                              ( 10), ( 11),
                                                              ( 12), ( 13),
                                                              ( 14), ( 15),
                                                              ( 16), ( 17),
                                                              ( 18), ( 19),
                                                              ( 20), ( 21),
                                                              ( 22), ( 23),
                                                              ( 24), ( 25),
                                                              ( 26), ( 27),
                                                              ( 28), ( 29),
                                                              ( 30), ( 31),
                                                              ( 32), ( 33),
                                                              ( 34), ( 35),
                                                              ( 36), ( 37),
                                                              ( 38), ( 39),
                                                              ( 40), ( 41),
                                                              ( 42), ( 43),
                                                              ( 44), ( 45),
                                                              ( 46), ( 47),
                                                              ( 48), ( 49),
                                                              ( 50), ( 51),
                                                              ( 52), ( 53),
                                                              ( 54), ( 55),
                                                              ( 56), ( 57),
                                                              ( 58), ( 59),
                                                              ( 60), ( 61),
                                                              ( 62), ( 63),
                                                              ( 64), ( 65),
                                                              ( 66), ( 67),
                                                              ( 68), ( 69),
                                                              ( 70), ( 71),
                                                              ( 72), ( 73),
                                                              ( 74), ( 75),
                                                              ( 76), ( 77),
                                                              ( 78), ( 79),
                                                              ( 80), ( 81),
                                                              ( 82), ( 83),
                                                              ( 84), ( 85),
                                                              ( 86), ( 87),
                                                              ( 88), ( 89),
                                                              ( 90), ( 91),
                                                              ( 92), ( 93),
                                                              ( 94), ( 95),
                                                              ( 96), ( 97),
                                                              ( 98), ( 99),
                                                              ( 100), ( 101),
                                                              ( 102), ( 103),
                                                              ( 104), ( 105),
                                                              ( 106), ( 107),
                                                              ( 108), ( 109),
                                                              ( 110), ( 111),
                                                              ( 112), ( 113),
                                                              ( 114), ( 115),
                                                              ( 116), ( 117),
                                                              ( 118), ( 119),
                                                              ( 120), ( 121),
                                                              ( 122), ( 123),
                                                              ( 124), ( 125),
                                                              ( 126), ( 127),
                                                              ( 128), ( 129),
                                                              ( 130), ( 131),
                                                              ( 132), ( 133),
                                                              ( 134), ( 135),
                                                              ( 136), ( 137),
                                                              ( 138), ( 139),
                                                              ( 140), ( 141),
                                                              ( 142), ( 143),
                                                              ( 144), ( 145),
                                                              ( 146), ( 147),
                                                              ( 148), ( 149),
                                                              ( 150), ( 151),
                                                              ( 152), ( 153),
                                                              ( 154), ( 155),
                                                              ( 156), ( 157),
                                                              ( 158), ( 159),
                                                              ( 160), ( 161),
                                                              ( 162), ( 163),
                                                              ( 164), ( 165),
                                                              ( 166), ( 167),
                                                              ( 168), ( 169),
                                                              ( 170), ( 171),
                                                              ( 172), ( 173),
                                                              ( 174), ( 175),
                                                              ( 176), ( 177),
                                                              ( 178), ( 179),
                                                              ( 180), ( 181),
                                                              ( 182), ( 183),
                                                              ( 184), ( 185),
                                                              ( 186), ( 187),
                                                              ( 188), ( 189),
                                                              ( 190), ( 191),
                                                              ( 192), ( 193),
                                                              ( 194), ( 195),
                                                              ( 196), ( 197),
                                                              ( 198), ( 199),
                                                              ( 200) 
                                                              ) t2 ( n )
                                                              ) t1
                                                        CROSS APPLY ( SELECT
                                                              Letter = SUBSTRING(Reference,
                                                              t1.n, 1)
                                                              ) ref
                                                        OUTER APPLY ( -- tgt
                                                              SELECT TOP ( WordLength )
                                                              n = ABS(t1.n
                                                              - t2.n)
                                                              FROM
                                                              ( VALUES ( 1),
                                                              ( 2), ( 3), ( 4),
                                                              ( 5), ( 6), ( 7),
                                                              ( 8), ( 9),
                                                              ( 10), ( 11),
                                                              ( 12), ( 13),
                                                              ( 14), ( 15),
                                                              ( 16), ( 17),
                                                              ( 18), ( 19),
                                                              ( 20), ( 21),
                                                              ( 22), ( 23),
                                                              ( 24), ( 25),
                                                              ( 26), ( 27),
                                                              ( 28), ( 29),
                                                              ( 30), ( 31),
                                                              ( 32), ( 33),
                                                              ( 34), ( 35),
                                                              ( 36), ( 37),
                                                              ( 38), ( 39),
                                                              ( 40), ( 41),
                                                              ( 42), ( 43),
                                                              ( 44), ( 45),
                                                              ( 46), ( 47),
                                                              ( 48), ( 49),
                                                              ( 50), ( 51),
                                                              ( 52), ( 53),
                                                              ( 54), ( 55),
                                                              ( 56), ( 57),
                                                              ( 58), ( 59),
                                                              ( 60), ( 61),
                                                              ( 62), ( 63),
                                                              ( 64), ( 65),
                                                              ( 66), ( 67),
                                                              ( 68), ( 69),
                                                              ( 70), ( 71),
                                                              ( 72), ( 73),
                                                              ( 74), ( 75),
                                                              ( 76), ( 77),
                                                              ( 78), ( 79),
                                                              ( 80), ( 81),
                                                              ( 82), ( 83),
                                                              ( 84), ( 85),
                                                              ( 86), ( 87),
                                                              ( 88), ( 89),
                                                              ( 90), ( 91),
                                                              ( 92), ( 93),
                                                              ( 94), ( 95),
                                                              ( 96), ( 97),
                                                              ( 98), ( 99),
                                                              ( 100), ( 101),
                                                              ( 102), ( 103),
                                                              ( 104), ( 105),
                                                              ( 106), ( 107),
                                                              ( 108), ( 109),
                                                              ( 110), ( 111),
                                                              ( 112), ( 113),
                                                              ( 114), ( 115),
                                                              ( 116), ( 117),
                                                              ( 118), ( 119),
                                                              ( 120), ( 121),
                                                              ( 122), ( 123),
                                                              ( 124), ( 125),
                                                              ( 126), ( 127),
                                                              ( 128), ( 129),
                                                              ( 130), ( 131),
                                                              ( 132), ( 133),
                                                              ( 134), ( 135),
                                                              ( 136), ( 137),
                                                              ( 138), ( 139),
                                                              ( 140), ( 141),
                                                              ( 142), ( 143),
                                                              ( 144), ( 145),
                                                              ( 146), ( 147),
                                                              ( 148), ( 149),
                                                              ( 150), ( 151),
                                                              ( 152), ( 153),
                                                              ( 154), ( 155),
                                                              ( 156), ( 157),
                                                              ( 158), ( 159),
                                                              ( 160), ( 161),
                                                              ( 162), ( 163),
                                                              ( 164), ( 165),
                                                              ( 166), ( 167),
                                                              ( 168), ( 169),
                                                              ( 170), ( 171),
                                                              ( 172), ( 173),
                                                              ( 174), ( 175),
                                                              ( 176), ( 177),
                                                              ( 178), ( 179),
                                                              ( 180), ( 181),
                                                              ( 182), ( 183),
                                                              ( 184), ( 185),
                                                              ( 186), ( 187),
                                                              ( 188), ( 189),
                                                              ( 190), ( 191),
                                                              ( 192), ( 193),
                                                              ( 194), ( 195),
                                                              ( 196), ( 197),
                                                              ( 198), ( 199),
                                                              ( 200) ) t2 ( n )
                                                              WHERE
                                                              SUBSTRING(@Target,
                                                              t2.n, 1) = ref.Letter
                                                              ) tgt
                                              GROUP BY  t1.n ,
                                                        ref.Letter ,
                                                        v.WordLength
                                            ) do
                                   )
                         END
        RETURN @score
    END

Here is the query to call the function

select [Party].[FuzySearch]('First Name Middle Name Last Name', C.FirstName) from dbo.Customer C

This is taking about 2 minutes 22 seconds to give me the percentage of fuzzy match for all

How can I fix this to run in lessthan a second. Any suggestions on my function to make it more robust.

Expected ouput is 45.34, 40.00, 100.00, 23.00, 81.23.....

HaBo
  • 13,999
  • 36
  • 114
  • 206
  • 2
    Out of curiosity, what algorithm are you implementing here? – billinkc Oct 08 '14 at 15:45
  • SQL Server provides [Fuzzy Lookups](http://msdn.microsoft.com/en-us/library/ms137786.aspx) and [Fuzzy Grouping](http://msdn.microsoft.com/en-us/library/ms141764.aspx) as part of SSIS - which means you can only use them in batch operations. To improve ad-hoc query performance you'd have to write a SQLCLR function implementing the algorithm you want. The string functions you use prevent SQL Server from using indexes. Regex matches in C# would perform better and result in far less code – Panagiotis Kanavos Oct 08 '14 at 15:54
  • You could also cheat and use the [Similarity SQLCLR functions](http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/) in SQL Server's Master Data Services – Panagiotis Kanavos Oct 08 '14 at 15:57
  • 2
    Also see [On-Demand Fuzzy Lookup](http://dba.stackexchange.com/questions/13593/on-demand-fuzzy-lookup) in dba.stackexchange.com for a Fuzzy Lookup SQLCLR implementation, or [Fuzzy Strings Matching using Levenshtein algorithm on SQL Server](http://www.pawlowski.cz/2010/12/sql_server-fuzzy-strings-matching-using-levenshtein-algorithm-t-sql-vs-clr/) for speed comparisons between T-SQL and SQLCLR (30x better in SQLCLR) – Panagiotis Kanavos Oct 08 '14 at 16:00
  • 1
    @billinkc Levenshtein – HaBo Oct 08 '14 at 18:02

2 Answers2

7

The best I have been able to do is simplify some of the query, and change it to a table valued function. Scalar functions are notoriously poor performers, and the benefit of an inline TVF is that the query definition is expanded out into the main query, much like a view.

This reduces the execution time significantly on the tests I have done.

ALTER FUNCTION dbo.FuzySearchTVF (@Reference VARCHAR(200), @Target VARCHAR(200))
RETURNS TABLE
AS
RETURN
(   WITH N (n) AS 
    (   SELECT  TOP (ISNULL(CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) 
                                    THEN DATALENGTH(@Reference) 
                                ELSE DATALENGTH(@Target) 
                            END, 0))    
                ROW_NUMBER() OVER(ORDER BY n1.n)
        FROM    (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N1 (n)
        CROSS JOIN (VALUES (1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) AS N2 (n)
        CROSS JOIN (VALUES (1), (1)) AS N3 (n)
        WHERE   @Reference IS NOT NULL AND @Target IS NOT NULL
    ), Src AS
    (   SELECT  Reference = CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) THEN @Reference
                                ELSE @Reference + REPLICATE('_', DATALENGTH(@Target) - DATALENGTH(@Reference))
                            END,
                Target = CASE WHEN DATALENGTH(@Target) > DATALENGTH(@Reference) THEN @Target
                                ELSE @Target + REPLICATE('_', DATALENGTH(@Target) - DATALENGTH(@Reference))
                            END,
                WordLength = CASE WHEN DATALENGTH(@Reference) > DATALENGTH(@Target) THEN DATALENGTH(@Reference) ELSE DATALENGTH(@Target) END
        WHERE   @Reference IS NOT NULL 
        AND     @Target IS NOT NULL
        AND     @Reference != @Target
    ), Scores AS
    (   SELECT  seq = t1.n ,
                Letter = SUBSTRING(s.Reference, t1.n, 1),
                s.WordLength ,
                LetterScore = s.WordLength - ISNULL(MIN(ABS(t1.n - t2.n)), s.WordLength)
        FROM    Src AS s
                CROSS JOIN N AS t1
                INNER JOIN N AS t2
                    ON SUBSTRING(@Target, t2.n, 1) = SUBSTRING(s.Reference, t1.n, 1)
        WHERE   @Reference IS NOT NULL 
        AND     @Target IS NOT NULL
        AND     @Reference != @Target
        GROUP BY t1.n, SUBSTRING(s.Reference, t1.n, 1), s.WordLength
    )
    SELECT  [Score] = 100 
    WHERE   @Reference = @Target
    UNION ALL
    SELECT  0
    WHERE   @Reference IS NULL OR @Target IS NULL
    UNION ALL
    SELECT  CAST(SUM(LetterScore) * 100.0 / MAX(WordLength * WordLength) AS NUMERIC(5, 2))
    FROM    Scores
    WHERE   @Reference IS NOT NULL 
    AND     @Target IS NOT NULL
    AND     @Reference != @Target
    GROUP BY WordLength
);

And this would be called as:

SELECT  f.Score
FROM    dbo.Customer AS c
        CROSS APPLY [dbo].[FuzySearch]('First Name Middle Name Last Name', c.FirstName) AS f

It is still a fairly complex function though, and, depending on the number of records in your customer table, I think getting it down to 1 second is going to be a bit of a challenge.

GarethD
  • 68,045
  • 10
  • 83
  • 123
  • Thank you this definetly improved teh performance. My SVF was taking about 1 min 15 seconds. Your TVF is taking 8 seconds. But Still looking to find if I can make it in lessthan a second – HaBo Oct 08 '14 at 18:28
  • Is there anyalternate for Cross Apply. This is killing the performance. – HaBo Oct 09 '14 at 17:41
  • Not to call the function. Realistically I think splitting 40000 strings is the real performance killer, and I can't see any possible way of getting this down to less than a second. I'd suggest looking for a CLR solution, as this is likely to offer the best performance. There is an example CLR in [this article](http://www.pawlowski.cz/2010/12/sql_server-fuzzy-strings-matching-using-levenshtein-algorithm-t-sql-vs-clr/). – GarethD Oct 10 '14 at 06:49
  • Thank you for the good start. CLR solution worked perfectly. it is now executing in ~1 second – HaBo Oct 15 '14 at 18:07
  • Thanks you so much for your fuzzy search – cat916 Apr 19 '18 at 05:35
4

This is how I could accomplish this:

Explained further @ SQL Server Fuzzy Search - Levenshtein Algorithm

Create below file using any editor of your choice:

using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;

public partial class StoredFunctions
{

    [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
    public static SqlDouble Levenshtein(SqlString stringOne, SqlString stringTwo)
    {
        #region Handle for Null value

        if (stringOne.IsNull)
            stringOne = new SqlString("");

        if (stringTwo.IsNull)
            stringTwo = new SqlString("");

        #endregion

        #region Convert to Uppercase

        string strOneUppercase = stringOne.Value.ToUpper();
        string strTwoUppercase = stringTwo.Value.ToUpper();

        #endregion

        #region Quick Check and quick match score

        int strOneLength = strOneUppercase.Length;
        int strTwoLength = strTwoUppercase.Length;

        int[,] dimention = new int[strOneLength + 1, strTwoLength + 1];
        int matchCost = 0;

        if (strOneLength + strTwoLength == 0)
        {
            return 100;
        }
        else if (strOneLength == 0)
        {
            return 0;
        }
        else if (strTwoLength == 0)
        {
            return 0;
        }

        #endregion

        #region Levenshtein Formula

        for (int i = 0; i <= strOneLength; i++)
            dimention[i, 0] = i;

        for (int j = 0; j <= strTwoLength; j++)
            dimention[0, j] = j;

        for (int i = 1; i <= strOneLength; i++)
        {
            for (int j = 1; j <= strTwoLength; j++)
            {
                if (strOneUppercase[i - 1] == strTwoUppercase[j - 1])
                    matchCost = 0;
                else
                    matchCost = 1;

                dimention[i, j] = System.Math.Min(System.Math.Min(dimention[i - 1, j] + 1, dimention[i, j - 1] + 1), dimention[i - 1, j - 1] + matchCost);
            }
        } 

        #endregion

        // Calculate Percentage of match
        double percentage = System.Math.Round((1.0 - ((double)dimention[strOneLength, strTwoLength] / (double)System.Math.Max(strOneLength, strTwoLength))) * 100.0, 2);

        return percentage;
    }
};

Name it levenshtein.cs

Go to Command Prompt. Go to the file directory of levenshtein.cs then call csc.exe /t: library /out: UserFunctions.dll levenshtein.cs you may have to give the full path of csc.exe from NETFrameWork 2.0.

Once your DLL is ready. Add it to the assemblies Database>>Programmability>>Assemblies>> New Assembly.

Create function in your database:

CREATE FUNCTION dbo.LevenshteinSVF
    (
      @S1 NVARCHAR(200) ,
      @S2 NVARCHAR(200)
    )
RETURNS FLOAT
AS EXTERNAL NAME
    UserFunctions.StoredFunctions.Levenshtein
GO

In my case I had to enable clr:

sp_configure 'clr enabled', 1
GO
reconfigure
GO

Test the function:

SELECT  dbo.LevenshteinSVF('James','James Bond')

Result: 50 % match

Bugs
  • 4,491
  • 9
  • 32
  • 41
HaBo
  • 13,999
  • 36
  • 114
  • 206
  • I am unable to add the assembly to SQL Server 2008 R2... Error reads: "Create failed for SqlAssembly 'UserFunctions'. (Microsoft.SqlServer.Smo) | CREATE ASSEMBLY for assembly 'UserFunctions' failed because the assembly is built for an unsupported version of the Common Language Runtime. (Microsoft SQL Server, Erro: 6257)". I'm running .NET 4.6? – Derek Foulk Oct 26 '15 at 17:58
  • you need to match the .NET framework version. Check what is the .NET version on SQL Server 2008 R2 and compile your assembly with that version. – HaBo Nov 05 '15 at 01:11
  • 1
    From the root of C drive: Open command prompt and entered: `C:\>C:\Windows\Microsoft.NET\Framework\v2.0.50727\csc.exe /t:library /out:UserFunctions.dll Levenshtein.cs` (NOTE: my Levenshtein.cs file was pasted at the root of C drive before running this command). Then I ran the following query in Server Management Studio: `USE MyDatabase GO CREATE ASSEMBLY UserFunctions from 'c:\UserFunctions.dll' WITH PERMISSION_SET = SAFE`. This is what I had to do to get this to work on my SQL Server 2008 R2 box... Thanks @HaBo! You've helped me big time! – Derek Foulk Nov 13 '15 at 20:20
  • Now I am seeing "Could not find method 'Levenshtein' for type 'StoredFunctions' in assembly 'UserFunctions'" when I try to create the dbo.Levenshtein function as you instructed... I made sure to try enabling clr (it was already enabled). Any ideas? – Derek Foulk Nov 13 '15 at 20:24
  • 1
    The CLR function is called HaBoLevenshtein. I think there was a typo in the original answer. Rename Levenshtein to HaBoLevenshtein in the SQL function and it should work. – Bob Black Jun 01 '17 at 02:48
  • What exactly this function returns? What does this percentage mean? – Przemyslaw Remin Aug 14 '18 at 09:14
  • @PrzemyslawRemin it returns pronunciation match in percentage – HaBo Aug 14 '18 at 09:57
  • @HaBo How do you count that percentage? What is in the numerator and what in the denominator of the fraction? Number of character deletes, inserts? I would like to know precise formula. – Przemyslaw Remin Aug 14 '18 at 10:47
  • @PrzemyslawRemin you need to read the code to understand it. Some references https://learn.microsoft.com/en-us/sql/t-sql/functions/soundex-transact-sql?view=sql-server-2017 https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017 – HaBo Aug 14 '18 at 13:36