19

Let's say I have a set of 2 words:

Alexander and Alecsander OR Alexander and Alegzander

Alexander and Aleaxnder, or any other combination. In general we are talking about human error in typing of a word or a set of words.

What I want to achieve is to get the percentage of matching of the characters of the 2 strings.

Here is what I have so far:

    DECLARE @table1 TABLE
(
  nr INT
  , ch CHAR
)

DECLARE @table2 TABLE
(
  nr INT
  , ch CHAR
)


INSERT INTO @table1
SELECT nr,ch FROM  [dbo].[SplitStringIntoCharacters] ('WORD w') --> return a table of characters(spaces included)

INSERT INTO @table2
SELECT nr,ch FROM  [dbo].[SplitStringIntoCharacters] ('WORD 5')

DECLARE @resultsTable TABLE
( 
 ch1 CHAR
 , ch2 CHAR
)
INSERT INTO @resultsTable
SELECT DISTINCt t1.ch ch1, t2.ch ch2 FROM @table1 t1
FULL JOIN @table2 t2 ON  t1.ch = t2.ch  --> returns both matches and missmatches

SELECT * FROM @resultsTable
DECLARE @nrOfMathches INT, @nrOfMismatches INT, @nrOfRowsInResultsTable INT
SELECT  @nrOfMathches = COUNT(1) FROM  @resultsTable WHERE ch1 IS NOT NULL AND ch2 IS NOT NULL
SELECT @nrOfMismatches = COUNT(1) FROM  @resultsTable WHERE ch1 IS NULL OR ch2 IS NULL


SELECT @nrOfRowsInResultsTable = COUNT(1)  FROM @resultsTable


SELECT @nrOfMathches * 100 / @nrOfRowsInResultsTable

The SELECT * FROM @resultsTable will return the following:

ch1         ch2
NULL        5
[blank]     [blank] 
D           D
O           O
R           R
W           W
easwee
  • 15,757
  • 24
  • 60
  • 83
Dragos Durlut
  • 8,018
  • 10
  • 47
  • 62

2 Answers2

35

Ok, here is my solution so far:

SELECT  [dbo].[GetPercentageOfTwoStringMatching]('valentin123456'  ,'valnetin123456')

returns 86%

CREATE FUNCTION [dbo].[GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(100)
    ,@string2 NVARCHAR(100)
)
RETURNS INT
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1)
    , @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[LEVENSHTEIN] (   @string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters

END




-- =============================================     
-- Create date: 2011.12.14
-- Description: http://blog.sendreallybigfiles.com/2009/06/improved-t-sql-levenshtein-distance.html
-- =============================================

CREATE FUNCTION [dbo].[LEVENSHTEIN](@left  VARCHAR(100),
                                    @right VARCHAR(100))
returns INT
AS
  BEGIN
      DECLARE @difference    INT,
              @lenRight      INT,
              @lenLeft       INT,
              @leftIndex     INT,
              @rightIndex    INT,
              @left_char     CHAR(1),
              @right_char    CHAR(1),
              @compareLength INT

      SET @lenLeft = LEN(@left)
      SET @lenRight = LEN(@right)
      SET @difference = 0

      IF @lenLeft = 0
        BEGIN
            SET @difference = @lenRight

            GOTO done
        END

      IF @lenRight = 0
        BEGIN
            SET @difference = @lenLeft

            GOTO done
        END

      GOTO comparison

      COMPARISON:

      IF ( @lenLeft >= @lenRight )
        SET @compareLength = @lenLeft
      ELSE
        SET @compareLength = @lenRight

      SET @rightIndex = 1
      SET @leftIndex = 1

      WHILE @leftIndex <= @compareLength
        BEGIN
            SET @left_char = substring(@left, @leftIndex, 1)
            SET @right_char = substring(@right, @rightIndex, 1)

            IF @left_char <> @right_char
              BEGIN -- Would an insertion make them re-align?
                  IF( @left_char = substring(@right, @rightIndex + 1, 1) )
                    SET @rightIndex = @rightIndex + 1
                  -- Would an deletion make them re-align?
                  ELSE IF( substring(@left, @leftIndex + 1, 1) = @right_char )
                    SET @leftIndex = @leftIndex + 1

                  SET @difference = @difference + 1
              END

            SET @leftIndex = @leftIndex + 1
            SET @rightIndex = @rightIndex + 1
        END

      GOTO done

      DONE:

      RETURN @difference
  END 
Dragos Durlut
  • 8,018
  • 10
  • 47
  • 62
  • so you posted a question for nothing ^^ – aF. Dec 15 '11 at 11:25
  • 24
    @aF. No, I did not post a question for nothing. I posted a question, then I continued to search for a solution to my problem. I found something of use and I posted it here so maybe someone with more knowledge than me can tell me if there is a better or a more precise way to acomplish this. Also, in the future maybe someone will benefit from this. I have had other similar situations(http://stackoverflow.com/questions/3107514/html-agility-pack-strip-tags-not-in-whitelist) where my answers have helped others. – Dragos Durlut Dec 15 '11 at 12:04
  • 2
    +1 for crediting the source of your Levenshtein distance algorithm in the code comments. Classy. – Jonathan Van Matre Dec 19 '11 at 16:35
  • @JonathanVM Well, he explained it better than I could. – Dragos Durlut Dec 20 '11 at 09:36
  • 1
    new link for the blog post http://randomrumenations.blogspot.com.br/2009/06/improved-t-sql-levenshtein-distance.html – Pascal Dec 15 '16 at 12:17
  • 1
    This algorithm is horrible. It considers the following 2 strings as "18%" similar: "129 W MCKNIGHT WAY" <---> "SPD". Seriously? There is literally nothing similar about those two strings. The address contains no "S", no "P", and no "D". They are different lengths. They should be listed as 0%. – alexGIS Oct 19 '17 at 23:25
  • 1
    Perfect... I switched from Oracle to SQL Server and I am surprised at the lack of easy built in functions that perform complex calculations in SQL Server. If you searched for the SQL Server equivalent to Oracles UTL_MATCH.edit_distance_similarity(col1, col2) function then you found the appropriate answer. – Code Novice Jul 20 '20 at 15:22
14

Ultimately, you appear to be looking to solve for the likelihood that two strings are a "fuzzy" match to one another.

SQL provides efficient, optimized built-in functions that will do that for you, and likely with better performance than what you have written. The two functions you are looking for are SOUNDEX and DIFFERENCE.

While neither of them solves exactly what you asked for - i.e. they do not return a percentage match - I believe they solve what you are ultimately trying to achieve.

SOUNDEX returns a 4-character code which is the first letter of the word plus a 3-number code that represents the sound pattern of the word. Consider the following:

SELECT SOUNDEX('Alexander')
SELECT SOUNDEX('Alegzander')
SELECT SOUNDEX('Owleksanndurr')
SELECT SOUNDEX('Ulikkksonnnderrr')
SELECT SOUNDEX('Jones')

/* Results:

A425
A425
O425
U425
J520

*/

What you will notice is that the three-digit number 425 is the same for all of the ones that roughly sound alike. So you could easily match them up and say "You typed 'Owleksanndurr', did you perhaps mean 'Alexander'?"

In addition, there's the DIFFERENCE function, which compares the SOUNDEX discrepancy between two strings and gives it a score.

SELECT DIFFERENCE(  'Alexander','Alexsander')
SELECT DIFFERENCE(  'Alexander','Owleksanndurr')
SELECT DIFFERENCE(  'Alexander', 'Jones')
SELECT DIFFERENCE(  'Alexander','ekdfgaskfalsdfkljasdfl;jl;asdj;a')

/* Results:

4
3
1
1     

*/

As you can see, the lower the score (between 0 and 4), the more likely the strings are a match.

The advantage of SOUNDEX over DIFFERENCE is that if you really need to do frequent fuzzy matching, you can store and index the SOUNDEX data in a separate (indexable) column, whereas DIFFERENCE can only calculate the SOUNDEX at the time of comparison.

andrews
  • 2,173
  • 2
  • 16
  • 29
  • +1 Thank you. I will take into consideration your response. By looks of it, I can only take into consideration the results that have "1". – Dragos Durlut Dec 19 '11 at 08:37
  • An exact match will return zero. I didn't show that in the example, but it's important to know so you don't start doing WHERE DIFFERENCE(...) = 1 and miss all the perfect matches. :) – Jonathan Van Matre Dec 19 '11 at 13:52
  • 5
    Actually, 0 indicates weak or no similarity and 4 indicates strong similarity. http://msdn.microsoft.com/en-us/library/ms188753.aspx – Auresco82 Dec 06 '12 at 13:49
  • 1
    Also Difference and Soundex only calculate the difference of the first word (SQL 2008 R2). So for a phrase you need the Levenshtein distance plus some fuzzy logic with string lengths. I came to a similar conclusion as above but was wondering about whether to use the shortest string's length, the longer string's length, or add them and divide by the difference. – James Murgolo Oct 08 '14 at 15:54
  • The results you list are upside down. I put an edit suggestion in. – Andy Raddatz Jun 11 '15 at 18:30
  • *This* is what I was looking for when I googled this question. – MGOwen Feb 07 '18 at 23:41
  • DIFFERENCE() is not a good test. Consider DIFFERENCE('MMG','MMG SDFSDF SDFSDF FFDF'). This still returns 4! – Fandango68 Sep 03 '21 at 00:03