5

In my application i need to identify a person by searching for their lastname and firstname. One requirement is to accept spelling errors to a certain degree.

My attempts to identify a person given a firstname and lastname were:

  1. sql query using soundex
  2. sql query using levenshtein-distance (LD) which was calculated with this LD-function

The screenshot contains some test records and the result of my sql-query which includes the soundex-value for each column and the LD

compare records using levenshtein distance

My current query looks like this

SELECT t2.*
        , t1.Firstname + ' ' + t1.Lastname as SourceName
        , 'Torsten Mueller' as TargetName
        , dbo.FUNC_LEVENSHTEIN(t1.Firstname +' '+ t1.Lastname
                            , 'Torsten Mueller', 8) as LEVENSHTEIN_Distance     
 FROM #TestSoundex t1
 LEFT JOIN #TestSoundex t2 ON t1.Id = t2.Id
 WHERE t1.Soundex_Firstname = SOUNDEX('Torsten')
       AND t1.Soundex_Lastname = SOUNDEX('Mueller')

As you can see i filter the result by soundex first and calculate the levenshtein-distance for the remaining records. In this sample below the levenshtein-distance ranges from 0 (both strings are equal) to 3.

SourceName       | TargetName      | Levenshtein Distance 
Thorsten Müller  | Torsten Mueller |  3 
Torsten Müller   | Torsten Mueller |  2
Thorsten Mueller | Torsten Mueller |  1
Torsten Mueller  | Torsten Mueller |  0

In this talk from a stanford professor the calculation for the distance is explained:

I N T E * N TION 
| | | | | | | 
* E X E C U TION
d s s   i s

Each deletion d, insertion i adds 1 point, a substitution s adds 2 points. The LD-function i use returns 5 points for the sample above, but only 3 instead of 4 for the distance between Thorsten Müller and Torsten Mueller. I

+1 point to delete h, 
+1 point instead of 2 to substitute ü 
+1 point to insert e

So i added some samples

Samples with Umlaut

Questions

My impression is that neither soundex nor LD are sufficient to uniquely identify a person-record given firstname and lastname and taking into account that there might be spelling mismatches.

  • Can you explain how this LD-Function handles Umlaute ü,ö,äso i can better understand the calculation?
  • What would you recommend as a max value for the distance to find the right match for firstname and lastname given the string s and t, should it be based on the length of both strings numberOrCharacters(s+t)/2 = max?

Source Code

This is the function i use from the linked answer. I only changed the name of the function from edit_distance_within to FUNC_LEVENSHTEIN

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO

CREATE FUNCTION FUNC_LEVENSHTEIN(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = '', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = '', @j = 1, @cmin = 4000
    WHILE @j <= @tl
    BEGIN
      SET @c = @c + 1
      SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END
      IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c
      SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1
    END
    IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
GO

Source to test function above

This is another test

CREATE TABLE #TestLevenshteinDistance(
    Id int  IDENTITY(1,1) NOT NULL,
    SourceName nvarchar(100) NULL,  
    Soundex_SourceName varchar(4) NULL,    
    Targetname nvarchar(100) NULL, 
    Soundex_TargetName varchar(4) NULL, 
    );      

INSERT INTO #TestLevenshteinDistance 
    (    SourceName,          
         Soundex_SourceName,
         Targetname,
         Soundex_TargetName) 
VALUES 
   ('Intention',SOUNDEX('Intention'), 'Execution', SOUNDEX('Execution')),    
   ('Karsten' , SOUNDEX('Karsten'), 'Torsten', SOUNDEX('Torsten')); 


SELECT t1.*
        , dbo.FUNC_LEVENSHTEIN(t1.SourceName, t1.Targetname, 8) as LEVENSHTEIN_Distance
        FROM #TestLevenshteinDistance t1
surfmuggle
  • 5,527
  • 7
  • 48
  • 77

0 Answers0