Matching string with LEVENSHTEIN algorithm

Question

create table tbl1
(
    name varchar(50)
);

insert into tbl1 values ('Mircrosoft SQL Server'),
                        ('Office Microsoft');

create table tbl2
(
    name varchar(50)
);

insert into tbl2 values ('SQL Server Microsoft'),
                        ('Microsoft Office');

I want to get the percentage of matching string between two tables column name.

I tried with LEVENSHTEIN algorithm. But what I want to achieve from given data is same between the tables but with different sequence so I want to see the output as 100% matching.

Tried: LEVENSHTEIN

SELECT  [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b 
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;

Result:

MatchedPercentage   tbl1_name               tbl2_name
-----------------------------------------------------------------
5                   Mircrosoft SQL Server   SQL Server Microsoft
10                  Office Microsoft        SQL Server Microsoft
15                  Mircrosoft SQL Server   Microsoft Office
13                  Office Microsoft        Microsoft Office

First thing you need to do is define your desired algorithm. Does 'data is the same between the tables but with different sequence' refer to chars or words? — Ben, Jan 15 '19 at 04:06
@Squirrel That is very shortsighted and lazy advice. The `GetPercentageOfTwoStringMatching` is hugely over engineered (not to mention ambiguous to any future maintainers) to simply find the matching words in two strings. — iamdave, Jan 15 '19 at 14:24
@Squirrel It is still bad advice and therefore shouldn't be given in the first place. Especially if you *know* it is bad advice, as the OP may not have the experience to know better. — iamdave, Jan 16 '19 at 11:35
@iamdave. Noted. I will remove it. As it is a very bad advice — Squirrel, Jan 17 '19 at 01:54

iamdave · Accepted Answer · 2019-01-17T09:18:08.757

As mentioned in the comments this can be achieved through the use of a string split table valued function. Personally I use one based on the very performant set-based tally table approach put together by Jeff Moden which is at the end of my answer.

Using this function allows you to compare the individual words as delimited by a space character and count up the number of matches compared to the total number of words in the two values.

Do note however that this solution falls over on any values with leading spaces. If this will be a problem, clean your data before running this script or adjust to handle them:

declare @t1 table(v nvarchar(50));
declare @t2 table(v nvarchar(50));

insert into @t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values');    -- Add in some extra values, with the same number of words and some with the same number of characters
insert into @t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');

with c as
(
    select t1.v as v1
            ,t2.v as v2
            ,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords  -- String Length - String Length without spaces = Number of words - 1
    from @t1 as t1
        cross join @t2 as t2    -- Cross join the two tables to get all comparisons
    where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
        ,c.v2
        ,c.NumWords
        ,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
    cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
    cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
        ,c.v2
        ,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);

Output

+----------------------+----------------------+----------+--------------+
|          v1          |          v2          | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft |        3 |            3 |
| Office Microsoft     | Microsoft Office     |        2 |            2 |
+----------------------+----------------------+----------+--------------+

Function

create function dbo.fn_StringSplit4k
(
     @str nvarchar(4000) = ' '              -- String to split.
    ,@delimiter as nvarchar(1) = ','        -- Delimiting value to split on.
    ,@num as int = null                     -- Which value to return.
)
returns table
as
return
                    -- Start tally table with 10 rows.
    with n(n)   as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)

                    -- Select the same number of rows as characters in @str as incremental row numbers.
                    -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
        ,t(t)   as (select top (select len(isnull(@str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)

                    -- Return the position of every value that follows the specified delimiter.
        ,s(s)   as (select 1 union all select t+1 from t where substring(isnull(@str,''),t,1) = @delimiter)

                    -- Return the start and length of every value, to use in the SUBSTRING function.
                    -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
        ,l(s,l) as (select s,isnull(nullif(charindex(@delimiter,isnull(@str,''),s),0)-s,4000) from s)

    select rn
          ,item
    from(select row_number() over(order by s) as rn
                ,substring(@str,s,l) as item
        from l
        ) a
    where rn = @num
        or @num is null;

I have come up with same scenario for large tables, when I say large tables means `table1` is of 10 millions records and `table2` is of 20 millions. What will be the best optimal solution? I tried with above query but it's more than half hour and the query is still running. — MAK, Jan 22 '19 at 06:41
@MAK Optimal depends entirely on your hardware and set up and I'm afraid is for you to test and come up with based on smaller sample sets of data. What you are trying to do is complex and therefore resource intensive at the best of times. If this is a one off exerxise you could prepare your data by doing some of the intermediate steps, persisting the data within tables and then indexing them ready for the next step. If this is to be a business as usual function, I would recommend trying to fix your issues at source. — iamdave, Jan 23 '19 at 08:09

Matching string with LEVENSHTEIN algorithm

1 Answers1

Output

Function