0

I have a t-sql query that looks like this:

select * from (
SELECT [Id], replace(ca.[AKey], '-', '') as [AKey1], rtrim(replace(replace(replace(lower([Name]), '#', ''), '(1.0)', ''), '(2.5)', '')) as [Name], [Key], dw.[AKey], replace(lower(trim([wName])), '#', '') as [wName]
  FROM [dbo].[wTable] ca
  FULL JOIN (select * from [dw].[wTable]) dw on 
  rtrim(left( replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%' 
  like 
  rtrim(left( replace(replace(replace(lower(ca.[Name]    ), '(1.0)', ''), '(2.5)', ''), '#', ''), 5))+'%'

  and 
  right(rtrim(replace(replace(replace(lower(dw.[wName]), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)
  like
  right(rtrim(replace(replace(replace(lower(ca.[Name]    ), '(1.0)', ''), '(2.5)', ''), '#', '')), 2)

  ) tp

As you can see, during the JOIN, it's removing some fuzzy characters that may or may not exist, and it's checking to see if the first 5 characters in the wName column match with the first 5 characters in the Name column, then doing the same for the last 2 characters in the columns.

So essentially, it's matching on the first 5 characters AND last 2 characters.

What I'm trying to add is an additional column that will tell me if the resulting columns are an exact match or if they are fuzzy. In other words, if they are an exact match it should say 'True' or something like that, and if they are a fuzzy match I would ideally like it to tell me how far off they are. For example, how many characters do not match.

SUMguy
  • 1,505
  • 4
  • 31
  • 61
  • Consider implementing the Levenshtein distance function in TSQL [Like this answer](https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql). You can then compare two fields, get back the Levenshtein distance and make decisions based off of that. – JNevill Nov 29 '17 at 16:23

1 Answers1

1

As JNevil mentioned you could use Levenshtein. You can also use Damarau-Levenshtein or the Longest Common Substring depending on how accurate you want to get and what your performance expectations are.

Below are two solutions. The first is a Levenshtein solution using a copy I grabbed from Phil Factor here. The Longest Common Substring solution uses my version of the Longest Common Substring which is fastest available for SQL Server (by far).

-- sample data
declare @t1 table (string1 varchar(100));
declare @t2 table (string2 varchar(100));
insert @t1 values ('abc'),('xxyz'),('1234'),('9923');
insert @t2 values ('abcd'),('xyz'),('2345'),('zzz');

-- Levenshtein    
select string1, string2, Ld 
from
(
  select *, Ld = dbo.LEVENSHTEIN(t1.string1, t2.string2)
  from @t1 t1
  cross join @t2 t2 
) compare 
where ld <= 2;

-- Longest Common Substring    
select string1, string2, lcss = item, lcssLen = itemlen, diff = mx.L-itemLen
from @t1 t1
cross join @t2 t2 
cross apply dbo.lcssWindowAB(t1.string1, t2.string2, 20)
cross apply (values (IIF(len(string1) > len(string2), len(string1),len(string2)))) mx(L)
where mx.L-itemLen <= 2;

RESULTS

string1  string2  Ld   
-------- -------- -----
abc      abcd     1
xxyz     xyz      1
1234     2345     2

string1  string2  lcss  lcssLen     diff
-------- -------- ----- ----------- -----------
abc      abcd     abc   3           1
xxyz     xyz      xyz   3           1
1234     2345     234   3           1
9923     2345     23    2           2

This does not answer your question but should get you started.

P.S. The Levenshtein function I posted does have a small bug, it says the distance between "9923" and "2345" is 4, the correct answer would be two. There's other Levenshtein functions out there though.

Alan Burstein
  • 7,770
  • 1
  • 15
  • 18