1

How can I convert the result of mdq.Similarity to the number of edits needed for the two words to match. This function is part of Master Data Service (MDS) in Microsoft SQL Server defined as:

USE [mds]
ALTER FUNCTION [mdq].[Similarity](@input1 [nvarchar](4000), @input2 [nvarchar](4000), @method [tinyint], @containmentBias [float], @minScoreHint [float])
RETURNS [float] WITH EXECUTE AS CALLER, RETURNS NULL ON NULL INPUT
AS EXTERNAL NAME [Microsoft.MasterDataServices.DataQuality].[Microsoft.MasterDataServices.DataQuality.SqlClr].[Similarity]

The two words that are 1 edit away from each other produce different Levenshtein distance, which seems to account for their length (number of characters in the word).

SELECT a=mds.mdq.Similarity('a','',0,0,0), 
ab=mds.mdq.Similarity('ab','a',0,0,0), 
abc=mds.mdq.Similarity('abc','ab',0,0,0), 
ac=mds.mdq.Similarity('ac','ab',0,0,0)

a   ab  abc     ac
0   0.5 0.67    0.5

Whereas I need it to return 1 in each case because each pair has two words differing by a single edit (insertion, deletion, substitution).

Oleg Melnikov
  • 3,080
  • 3
  • 34
  • 65

2 Answers2

2

You can't.

I haven't been able to do anything that useful with this function. The main reason I'm posting this is so that people can access the msdn documentation as it does not come up when doing a google search.

Per msdn, mdq.Similarity:

Returns a similarity score. This score indicates the similarity between two strings that are compared by a specified match algorithm.

I understand Levenshtein, Jaro, LCSS but, after playing around with this function for many hours, it's hard to do anything useful with it. mdq.Similarity is a dead end IMO.

This

--Using Levenshtein edit distance algorithm.
SELECT Lev = mdq.Similarity(N'Alexia Geogio', N'Alexandra George', 0, 0.0, 0.0);
--Using Jaccard similarity coefficient algorithm.
SELECT Jacc = mdq.Similarity(N'Alexia Geogio', N'Alexandra George', 1, 0.0, 0.0);
--Using Jaro-Winkler distance algorithm.
SELECT jaro = mdq.Similarity(N'Alexia Geogio', N'Alexandra George', 2, 0.0, 0.0);
--Using longest common subsequence algorithm.
SELECT lcss = mdq.Similarity(N'12345', N'93459', 3, 0.0, 0.0);

Returns

Lev
----------------------
0.5625

Jacc
----------------------
0.269230769230769

jaro
----------------------
0.878846153846154

lcss
----------------------
0.6
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18
  • 1
    Hi Alan. Thanks for the link. Indeed, I failed to find a documentation via Bing, Google, and Yahoo. Isn't there a precise formula on how the Levenshtein is computed in `Similarity`. I could then back out the original edit distance I need. Or, may be there is an author's contact at Microsoft who can bring clarity into what is actually being computed. – Oleg Melnikov Nov 03 '17 at 04:38
2

Inspired by Alan's answer, I digged around for normalized Levenshtein distance. Found this answer and, apparently, it works swell: denormalized mdq.Similarity outputs number of edit operations. So, MDS function computes

1- distance(a,b)/max(a.length, b.length)

So, we reverse engineer the original Levenshtein distance as (note the rounding!!):

CREATE FUNCTION fnLevDist(@a VARCHAR(100), @b VARCHAR(100), @minScoreHint FLOAT=0) 
RETURNS INT AS 
BEGIN
    DECLARE @scaler REAL = CASE WHEN LEN(@a)>LEN(@b) THEN LEN(@a) ELSE LEN(@b) END
    RETURN ROUND((1.0 - mds.mdq.Similarity(@a, @b, 0, 0, @minScoreHint)) * @scaler, 0)
END
GO

WITH txt AS(
SELECT a='a', b='' 
UNION ALL SELECT a='ab', b='a' 
UNION ALL SELECT a='abc', b='ab' 
UNION ALL SELECT a='ac', b='ab' )
SELECT *, Lev=dbo.fnLevDist(a,b,0), Lev_nmzd=mds.mdq.Similarity(a,b, 0, 0, 0) FROM txt

I've retained the @minScoreHint parameter in the wrapping function because it can greatly improve performance (see doc). The output is then:

a   b   Lev Lev_nmzd
a       1   0
ab  a   1   0.5
abc ab  1   0.667
ac  ab  1   0.5

So, it's not so bad afterall. Still, I wish that Microsoft has cited the publication they build their machine learning work on. It is already a standard practice for package documentation in Python, R, and other software (GPL or otherwise).

Oleg Melnikov
  • 3,080
  • 3
  • 34
  • 65
  • I've been meaning to play around with this but haven't had a chance yet. Nonetheless, what you did looks absolutely brilliant Oleg. – Alan Burstein Nov 06 '17 at 04:45