4

I am doing a custom fuzzy matching algorithm that we need in one of our inter applications. I am trying to speed it up. When I do a cross apply against a fuzzy function to find suggested matches, I do not want to search unecessary data.

Here is the function:

select top 5 Manufacturer,ManufacturerPartNumber,Description as ManufacturerDescription, CONVERT(money,Price) as Price,fms.Score  
from Products_OurProducts_Products_View
    CROSS APPLY (
         select 
         dbo.FuzzyControlMatch(@objectModel, ManufacturerPartNumber) AS score
    ) AS fms

ORDER BY fms.Score DESC

Now say my manufacturer part number sent by the user is FD1234 we do not really need to fuzzy over ALL manufacturerpartnumbers. Can I add a clause like so to get a more precise data set to fuzzy, or will this happen AFTER the cross apply already runs and only affect the final result.

 select top 5 Manufacturer,ManufacturerPartNumber,Description as ManufacturerDescription, CONVERT(money,Price) as Price,fms.Score  
from Products_OurProducts_Products_View
    CROSS APPLY (
         select 
         dbo.FuzzyControlMatch(@objectModel, ManufacturerPartNumber) AS score
    ) AS fms
 WHERE LEN(ManufacturerPartNUmber) < LEN(@objectModel)+5

I was hoping for the cross apply to only fuzzy against items close to the length of the parameter in @objectmodel.

Casey ScriptFu Pharr
  • 1,672
  • 1
  • 16
  • 36
  • 2
    Instead of a scalar function you will have better luck with an inline table valued function. Also, please note you have top 5 but no order by. – Sean Lange Oct 17 '19 at 20:36
  • I will look up inline table function. Also, wont the top 5 work with Order by Score? Ops, let me update it. I normally have it in there. – Casey ScriptFu Pharr Oct 17 '19 at 20:49
  • Amen to what Sean said. Note this article: https://www.sqlservercentral.com/forums/topic/how-to-make-scalar-udfs-run-faster-sql-spackle ... a caveat here is that, if your scalar UDF is a well written CLR (a majority of fuzzy matching functions I've seen in SQL Server are) then performance should be fine, perhaps better. Pre-SQL 2019 T-SQL scalar UDFs, on the other hand, are irrefutably horrific. – Alan Burstein Oct 17 '19 at 20:58
  • LOL of course the order by will work....it wasn't posted when I made my comment. You added that later. ;) – Sean Lange Oct 17 '19 at 21:01
  • *"please note you have top 5 but no order by. "* also note @SeanLange that a `ORDER BY` in combination with `TOP` can still give non deterministic results in cases of ties.. Pagination / batch processing in SQL in general always requires a deterministic sort, meaning having atleast using one column in the `ORDER BY` which you know has unique values for sure like a indentity or unique key -> more or less means using `ORDER BY score, unique_id ASC/DESC` – Raymond Nijland Oct 17 '19 at 21:07
  • 2
    @RaymondNijland true, but top 5 with no order by is what we saw here. Wasn't even going into the complexities of ties. But worth noting indeed. – Sean Lange Oct 17 '19 at 21:12
  • 1
    yes i agree.. @SeanLange we didn't see table structures which might (very big ??) could have changed the meaning of this SQL.. Anyhow i tend to warn people about this as is it is easy forgotten and or overlooked.. – Raymond Nijland Oct 17 '19 at 21:15
  • I think the CROSS APPLY does not take into account WHERE condition because CROSS APPLY is inside FROM statement and the WHERE condition is evaluated after that. – Muflix May 24 '21 at 10:24

1 Answers1

3

I was hoping for the cross apply to only fuzzy against items close to the length of the parameter in @objectmodel.

To measure proximity use ABS(a-b). If you want strings of similar(close) lengths it would be:

ABS(LEN(string1) - LEN(string2))

If you wanted strings that, say, no more than 5 characters longer/shorter than the other, the WHERE clause would look like

WHERE ABS(LEN(string1) - LEN(string2)) <= 5

Can I add a clause like so to get a more precise data set to fuzzy, or will this happen AFTER the cross apply already runs and only affect the final result.

SQL Server Logic query Processing (https://images.app.goo.gl/qTQpdg2NsC7M4eUj9) dictates that the WHERE clause is evaluated second (after FROM). Scalar UDFs in WHERE clauses change this. If dbo.FuzzyControlMatch is a T-SQL scalar UDF and is used in the WHERE clause then it will be processed first and you will be forced to evaluate everything. That does not appear to be the case with what you posted though.

How to improve performance here?

What I would do for starters is compute the string length ahead of time. You could use a persisted computed column then add an index ON (stringlength, whatever that TOP 5 is to be ordered by). Then include the other columns used by that query for INCLUDE columns.

Alternatively you could pre-filter using a temp table or table variable and apply your function there instead.

Either way, I have a function and a couple algorithms that will change your life if you deal with string similarity often. I'll post those later tonight when I have more time.

CONTINUED....

I'm not familiar with your similarity function as you have not provided any DDL but I do know how to measure string similarity https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227 using algorithms such as Levenshtein, Damerau-Levenshtein, The Longest Common Substring, and the Longest Common Subsequence (LCSQ). Levenshtein can be solved in O(mn) time. Damerau-Levenshtein in O(mn*(Max(m,n)) time. The Longest Common Substring in O(nm) or O(n)+O(nk) with a generalized suffix tree. The Longest Common Subsequence between two strings is NP-Hard.

Levenshtein & Damerau-Levenshtein are Distance metrics https://en.wikipedia.org/wiki/String_metric and can measure similary like so: take two strings S1 and S2, S1 always being shorter or equal to S2; L1 as the length of S1, L2 the length of S2 and E as your Edit Distance... The formula for the similarity between (S1,S2) is: (L1-E)/L2. Consider the words, "Theirs" and "Thiers". Both strings are 6 characters long and the Levenshein Distance is 2. (6-2)/6 = .67; according to Levenshtein these guys are 67% similar. The Damerau-Levenshtein Distance between these strings is 1; (6-1)/6 = .83; per DLD these strings have a similarity score of 83%.

With the longest common substring or subsequence (length of both as LS) the similarity LS/L2. E.g. The longest common substring between 'ABC123' and "ABC12.3" is "ABC12"; LS=5, L2=7, 5/7 = 71% similar. The longest Common Subsequence between the two is "ABC123". LS=6, SELECT 6./7 = 86% similarity.

Introducing the Bernie Distance; it's simple: When S1 is a Substring of S2 the Bernie Distance (BD) is L2-L1 and Similarity can be measured by L1/L2. For example: BD("xxx","yyy") returns NULL; BD("Kangaroo", "Kangarooo") = 1... L2=9, L1=8, L2-L1=1. Here, Bernie gives us a distance of 1 and a similarity score of SELECT .88 (88%). These two metrics are computed in O-nastyfast time - they're basically free. The Bernie Mertics will often be NULL; when that's the case you've lost nothing and gained nothing. When bernie isn't NULL, however, you have just accomplished something special.... *You have solved Levenshtein (LD), Damerau-Levenshtein (DLD), the Longest Common Substring & Subsequence (LCSS) and many, many more. * When Bernie (B) is NOT NULL then LD=B, DLD=B and LCSS=L1. I would not be surprised if you could apply bernie to your similarity function ;^) This is known as reduction:

Included at the end of this post is bernie8K (VARCHAR(8000)). In addition to the Bernie Distance and Similary you can use bernie to calculate the Maximum Similarity (MS). Ex: MS = L1/L2. The MS("ABCD","ABCXYZ") is 67%. In other words, when L1=4 and L2=6, the two strings cannot be any more than 67% (4/6=0.6666). Armed with this information you can create a Minimum Similarity parameter that allows you to drastically reduce the number of comparisons. Now the Demo.

Problem:

I once had a large client with 1000's of employees. The DB they inherited 100's of job duplicate job titles that had been entered by hand such as "Loan Officer" and "Loan Officr". Reports would report that they had 2005 Loan Officers and 16 Loan Officrs. In reality they had 2021 Loan Officers (16 with misspelled job titles.) The task was to identify (and de-dupe) these job titles. This example is a scaled down version of the problem. Note my comments.

-- Sample data.
DECLARE @jobs TABLE
(
  JobId  INT IDENTITY PRIMARY KEY,
  JobCat VARCHAR(100) NOT NULL
);

INSERT @jobs(JobCat) 
VALUES('Editing Department'),('Director'),('Producer'),('Actor'),
      ('Film Editing Department'),('Producers'),('Directer');

-- without any pre-filtering I would need to compare 21 pairs of strings "strings pairs"....
SELECT      j1.JobCat, j2.JobCat
FROM        @jobs AS j1
CROSS JOIN  @jobs AS j2
CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS  b
WHERE       j1.JobId < j2.JobId;

Returns:

    JobCat                             JobCat
---------------------------------- ---------------------------------
Editing Department                 Director
Editing Department                 Producer
...
Director                           Directer
Producer                           Actor
...

Now we'll leverage the Bernie Distance to get our answer and exclude unnecessary comparisons. String pairs where B is not NULL have been solved, string pairs where the MS < @MinSim have been removed. We just reduced our work from 21 comparisons to 5 and identified 2 duplicates very quickly.

DECLARE @MinSim DEC(6,4) = .8;

SELECT      j1.JobId, j2.JobId, b.S1, b.S2, b.L1, b.L2, b.B, b.MS, b.S
FROM        @jobs AS j1
CROSS JOIN  @jobs AS j2
CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS  b
WHERE       j1.JobId < j2.JobId
AND         (b.MS >= @MinSim OR b.B IS NOT NULL);

Returns:

    JobId       JobId       S1                         S2                    L1   L2  B     MS       S       
----------- ----------- --------------------- -------------------------- ---- --- ----- -------- -------
1           5           Editing Department    Film Editing Department    18   23  5     0.7826   0.7826
2           3           Director              Producer                   8    8   NULL  1.0000   NULL
2           6           Director              Producers                  8    9   NULL  0.8889   NULL
2           7           Director              Directer                   8    8   NULL  1.0000   NULL
3           6           Producer              Producers                  8    9   1     0.8889   0.8889
3           7           Producer              Directer                   8    8   NULL  1.0000   NULL
6           7           Directer              Producers                  8    9   NULL  0.8889   NULL

This reduction stuff is cool! Let's bring a couple more algorithms to the party. First we'll grab a copy ngrams8k and create a function to compute the Hamming Distance for similarity. Hamming (HD) can be computed in O(n) time; the similarity as (L1-HD)/L2. Note that, when HD=1, then LD=1, DLD=1, LCSS=L1-1 and we probably computed your similarity too.

-- Sample data.
DECLARE @jobs TABLE
(
  JobId  INT IDENTITY PRIMARY KEY,
  JobCat VARCHAR(100) NOT NULL
);

INSERT @jobs(JobCat) 
VALUES('Editing Department'),('Director'),('Producer'),('Actor'),
      ('Film Editing Department'),('Producers'),('Directer');

DECLARE @MinSim DECIMAL(6,4) = .8;

WITH br AS
(
  SELECT      b.*
  FROM        @jobs AS j1
  CROSS JOIN  @jobs AS j2
  CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS  b
  WHERE       j1.JobId < j2.JobId
  AND         (b.MS >= @MinSim OR b.B IS NOT NULL)
) 
SELECT      br.S1, br.S2, br.L1, br.L2, br.D, S = h.MinSim
FROM        br
CROSS APPLY samd.HammingDistance8k(br.S1, br.S2) AS h
WHERE       br.B IS NULL
AND         h.MinSim >= @MinSim
UNION ALL
SELECT      br.S1, br.S2, br.L1, br.L2, br.D, br.S
FROM        br
WHERE       br.B IS NOT NULL;

Returns:

S1                     S2                        L1          L2          D           S
---------------------- ------------------------- ----------- ----------- ----------- --------------
Director               Directer                  8           8           0           0.87500000000
Editing Department     Film Editing Department   18          23          5           0.78260000000
Producer               Producers                 8           9           1           0.88890000000

Summary:

We started with 21 string pairs to compare. Using Bernie we reduced that number to 5 (2 solved, 14 excluded) Using Hamming we picked off another. Only four to go!

The Functions:

CREATE FUNCTION samd.bernie8K
(
  @s1 VARCHAR(8000), 
  @s2 VARCHAR(8000)
)
/*****************************************************************************************
[Purpose]:
 This function allows developers to optimize and simplify how they fuzzy comparisons 
 between two strings (@s1 and @s2). 

 bernie8K returns:
  S1  = short string - LEN(S1) will always be <= LEN(S2); The formula to calculate S1 is:
          S1 = CASE WHEN LEN(@s1) > LEN(@s2) THEN @s2, ELSE @s1 END;
  S2  = long string  - LEN(S1) will always be <= LEN(S2); The formula to calculate S1 is:
          S2 = CASE WHEN LEN(@s1) > LEN(@s2) THEN @s1, ELSE @s2;
  L1  = short string length = LEN(S1)
  L2  = long string length  = LEN(S2)
  D   = distance            = L2-L1; how many characters needed to make L1=L2; D tells us:
          1. D    is the *minimum* Levenshtein distance between S1 and S2
          2. L2/D is the *maximum* similarity between S1 and S2
  I   = index               = CHARINDEX(S1,S2);
  B   = bernie distance     = When B is not NULL then:
          1. B = The Levenshtein Distance between S1 and S2
          2. B = The Damarau-Levenshtein Distance bewteen S1 and S2
          3. B = The Longest Common Substring & Longest Common Subsequence of S1 and S2
          4. KEY! = The similarity between L1 and L2 is L1/l2
  MS  = Max Similarity      = Maximum similarity
  S   = Minimum Similarity  = When B isn't null S is the same Similarity value returned by
        mdq.Similarity: https://msdn.microsoft.com/en-us/library/ee633878(v=sql.105).aspx

[Author]:
  Alan Burstein

[Compatibility]: 
 SQL Server 2005+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse

[Parameters]:
 @s1 = varchar(8000); First of two input strings to be compared
 @s2 = varchar(8000); Second of two input strings to be compared

[Returns]:
 S1 = VARCHAR(8000); The shorter of @s1 and @s2; returns @s1 when LEN(@s1)=LEN(@s2)
 S2 = VARCHAR(8000); The longer  of @s1 and @s2; returns @s2 when LEN(@s1)=LEN(@s2)
 L1 = INT; The length of the shorter of @s1 and @s2 (or both when they're of equal length)
 L2 = INT; The length of the longer  of @s1 and @s2 (or both when they're of equal length)
 D  = INT; L2-L1; The "distance" between L1 and L2
 I  = INT; The location (position) of S1 inside S2; Note that when 1>0 then S1 is both:
       1.  a substring   of S2
       2.  a subsequence of S2
 B  = INT; The Bernie Distance between @s1 and @s1; When B is not null then:
       1. B = The Levenshtein Distance between S1 and S2
       2. B = The Damarau-Levenshtein Distance bewteen S1 and S2
       3. B = The Longest Common Substring & Longest Common Subsequence of S1 and S2
       4. KEY! = The similarity between L1 and L2 is L1/l2
 MS = DECIMAL(6,4); Returns the same simlarity score as mdq.Similarity would if S1 where a
      substring of S2
 S  = DECIMAL(6,4); When B isn't null then S is the same Similarity value returned by
      mdq.Similarity

 For more about mdq.Similarity visit:
    https://msdn.microsoft.com/en-us/library/ee633878(v=sql.105).aspx

[Syntax]:
--===== Autonomous
 SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
 FROM   samd.bernie8K('abc123','abc12') AS b;

--===== CROSS APPLY example
 SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
 FROM        dbo.SomeTable            AS t
 CROSS APPLY samd.bernie8K(t.S1,t.S2) AS b;

[Dependencies]:
 N/A

[Developer Notes]:
 X. Bernie ignores leading and trailing spaces trailing, and returns trimmed strings!
 1. When @s1 is NULL then S2 = @s2, L2 = LEN(@s2); 
    When @s2 is NULL then S1 = @s1, L1 = LEN(@s1)
 2. bernie8K ignores leading and trailing whitespace on both input strings (@s1 and @s2). 
    In other words LEN(@s1)=DATALENGTH(@s1), LEN(@s2)=DATALENGTH(@s2)
 3. bernie8K is deterministic; for more about deterministic and nondeterministic
    functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx

[Examples]:
--==== 1. BASIC USE:
  -- 1.1. When b.I > 0  
  SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
  FROM samd.bernie8K('abc1234','bc123') AS b;

  -- 1.2. When b.I = 0
  SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
  FROM samd.bernie8K('abc123','xxx') AS b;
-----------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20180708 - Inital Creation - Alan Burstein
 Rev 01 - 20181231 - Added Boolean logic for transpositions (TX column) - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT 
  TX = base.TX,     -- transposed? boolean - were S1 and S2 transposed?
  S1 = base.S1,     -- short string >> IIF(LEN(@s1)>LEN(@s2),@s2,@s1)
  S2 = base.S2,     -- long  string >> IIF(LEN(@s1)>LEN(@s2),@s1,@s2)
  L1 = base.L1,     -- short string length >> IIF(LEN(@s1)>LEN(@s2),LEN(@s2),LEN(@s1))
  L2 = base.L2,     -- long  string length >> IIF(LEN(@s1)>LEN(@s2),LEN(@s1),LEN(@s2))
  D  = base.D,        -- bernie string distance >> # of characters needed to make L1=L2
  I  = iMatch.idx,  -- bernie index >> position of S1 within S2
  B  = bernie.D,      -- bernie distance >> IIF(CHARINDEX(S1,S2)>0,L2-L1,NULL)
  MS = maxSim.D,    -- maximum similarity
  S  = similarity.D -- (minimum) similarity
FROM
(
  SELECT
    TX = CASE WHEN ls.L=1 THEN 1 ELSE 0 END,
    S1 = CASE WHEN ls.L=1 THEN s.S2 ELSE s.S1 END,
    S2 = CASE WHEN ls.L=1 THEN s.S1 ELSE s.S2 END,
    L1 = CASE WHEN ls.L=1 THEN l.S2 ELSE l.S1 END,
    L2 = CASE WHEN ls.L=1 THEN l.S1 ELSE l.S2 END,
    D  = ABS(l.S1-l.S2)

  FROM        (VALUES(LEN(LTRIM(@s1)),LEN(LTRIM(@s2))))     AS l(S1,S2) -- LEN(S1,S2)
  CROSS APPLY (VALUES(RTRIM(LTRIM(@S1)),RTRIM(LTRIM(@S2)))) AS s(S1,S2) -- S1 and S2 trimmed
    CROSS APPLY (VALUES(SIGN(l.S1-l.S2)))                     AS ls(L)    -- LeftLength
) AS base
CROSS APPLY (VALUES(ABS(SIGN(base.L1)-1),ABS(SIGN(base.L2)-1)))             AS blank(S1,S2)
CROSS APPLY (VALUES(CHARINDEX(base.S1,base.S2)))                            AS iMatch(idx)
CROSS APPLY (VALUES(CASE WHEN SIGN(iMatch.idx|blank.S1)=1 THEN base.D END)) AS bernie(D)
CROSS APPLY (VALUES(CAST(CASE blank.S1 WHEN 1 THEN 1.*blank.S2 
                      ELSE 1.*base.L1/base.L2 END AS DECIMAL(6,4))))        AS maxSim(D)
CROSS APPLY (VALUES(CAST(1.*NULLIF(SIGN(iMatch.idx),0)*maxSim.D 
                      AS DECIMAL(6,4))))                                    AS similarity(D);
GO

CREATE FUNCTION dbo.rangeAB
(
  @low  BIGINT, -- (start) Lowest  number in the set
  @high BIGINT, -- (stop)  Highest number in the set
  @gap  BIGINT, -- (step)  Difference between each number in the set
  @row1 BIT     -- Base: 0 or 1; should RN begin with 0 or 1?
)
/****************************************************************************************
[Purpose]:
 Creates a lazy, in-memory...

[Author]: Alan Burstein

[Compatibility]: 
 SQL Server 2008+ and Azure SQL Database 

[Syntax]:
 SELECT r.RN, r.OP, r.N1, r.N2
 FROM   dbo.rangeAB(@low,@high,@gap,@row1) AS r;

[Parameters]:
 @low  = BIGINT; represents the lowest  value for N1.
 @high = BIGINT; represents the highest value for N1.
 @gap  = BIGINT; represents how much N1 and N2 will increase each row. @gap also
         represents the difference between N1 and N2.
 @row1 = BIT; represents the base (first) value of RN. When @row1 = 0, RN begins with 0,
         when @row = 1 then RN begins with 1.

[Returns]:
 Inline Table Valued Function returns:
 RN = BIGINT; a row number that works just like T-SQL ROW_NUMBER() except that it can 
      start at 0 or 1 which is dictated by @row1. If you are returning the numbers:
      (0 or 1) Through @high, then use RN as your "N" value, otherwise use N1.
 OP = BIGINT; returns the "opposite number" that relates to RN. When RN begins with 0 the
      first number in the set will be 0 for RN, the last number in will be 0 for OP. When
      RN is 1 to 10, the numbers 1 to 10 are retrurned in ascending order for RN and in
      descending order for OP. 

      Given the Numbers 1 to 3, 3 is the opposite of 1, 2 the opposite of 2, and 1 is the
      opposite of 3. Given the numbers -1 to 2, the opposite of -1 is 2, the opposite of 0
      is 1, and the opposite of 1 is 0.  
 N1 = BIGINT; This is the "N" in your tally table/numbers function. this is your *Lazy* 
      sequence of numbers starting at @low and incrimenting by @gap until the next number
      in the sequence is greater than @high.
 N2 = BIGINT; a lazy sequence of numbers starting @low+@gap and incrimenting by @gap. N2
      will always be greater than N1 by @gap. N2 can also be thought of as:
      LEAD(N1,1,N1+@gap) OVER (ORDER BY RN)

[Dependencies]:
N/A

[Developer Notes]:
 1. The lowest and highest possible numbers returned are whatever is allowable by a 
    bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3). 
 2. @gap does not affect RN, RN will begin at @row1 and increase by 1 until the last row
    unless its used in a subquery where a filter is applied to RN.
 3. @gap must be greater than 0 or the function will not return any rows.
 4. Keep in mind that when @row1 is 0 then the highest RN value (ROWNUMBER) will be the 
    number of rows returned minus 1
 5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
    use the RN column. Use N1 and/or N2 when you need to begin your sequence at any 
    number other than 0 or 1 or if you need a gap between your sequence of numbers. 
 6. Although @gap is a bigint it must be a positive integer or the function will
    not return any rows.
 7. The function will not return any rows when one of the following conditions are true:
      * any of the input parameters are NULL
      * @high is less than @low 
      * @gap is not greater than 0
    To force the function to return all NULLs instead of not returning anything you can
    add the following code to the end of the query:

      UNION ALL 
      SELECT NULL, NULL, NULL, NULL
      WHERE NOT (@high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0)

    This code was excluded as it adds a ~5% performance penalty.
 8. There is no performance penalty for sorting by rn ASC; there is a large performance 
    penalty for sorting in descending order WHEN @row1 = 1; WHEN @row1 = 0
    If you need a descending sort the use OP in place of RN then sort by rn ASC. 
 9. For 2012+ systems, The TOP logic can be replaced with:
   OFFSET 0 ROWS FETCH NEXT 
     ABS((ISNULL(@high,0)-ISNULL(@low,0))/ISNULL(@gap,0)+ISNULL(@row1,1)) ROWS ONLY

Best Practices:
--===== 1. Using RN (rownumber)
 -- (1.1) The best way to get the numbers 1,2,3...@high (e.g. 1 to 5):
 SELECT r.RN
 FROM   dbo.rangeAB(1,5,1,1) AS r;

 -- (1.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 0 to 5):
 SELECT r.RN
 FROM   dbo.rangeAB(0,5,1,0) AS r;

--===== 2. Using OP for descending sorts without a performance penalty
 -- (2.1) The best way to get the numbers 5,4,3...@high (e.g. 5 to 1):
 SELECT   r.OP
 FROM     dbo.rangeAB(1,5,1,1) AS r 
 ORDER BY R.RN ASC;

 -- (2.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 5 to 0):
 SELECT   r.OP 
 FROM     dbo.rangeAB(1,6,1,0) AS r
 ORDER BY r.RN ASC;

--===== 3. Using N1
 -- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
 SELECT r.N1
 FROM   dbo.rangeAB(-3,3,1,1) AS r;

 -- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
 SELECT r.RN, r.N1
 FROM   dbo.rangeAB(-3,3,1,1) AS r;

 -- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
 SELECT r.RN, r.N1
 FROM dbo.rangeAB(-3,3,1,0) AS r;

--===== 4. Using N2 and @gap
 -- (4.1) To get 0,10,20,30...100, set @low to 0, @high to 100 and @gap to 10:
 SELECT r.N1
 FROM   dbo.rangeAB(0,100,10,1) AS r;
 -- (4.2) Note that N2=N1+@gap; this allows you to create a sequence of ranges.
 --       For example, to get (0,10),(10,20),(20,30).... (90,100):

 SELECT r.N1, r.N2
 FROM  dbo.rangeAB(0,90,10,1) AS r;

-----------------------------------------------------------------------------------------
[Revision History]:
 Rev 00 - 20140518 - Initial Development - AJB
 Rev 01 - 20151029 - Added 65 rows. Now L1=465; 465^3=100.5M. Updated comments - AJB
 Rev 02 - 20180613 - Complete re-design including opposite number column (op)
 Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - AJB
 Rev 04 - 20190306 - Added inline aliasing function(f): 
                     f.R=(@high-@low)/@gap, f.N=@gap+@low - AJB
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N)  AS 
(
  SELECT 1
  FROM (VALUES
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0)) T(N) -- 90 values
),
L2(N)  AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT RN = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
  SELECT
    RN = 0,
    OP = (@high-@low)/@gap,
    N1 = @low,
    N2 = @gap+@low
  WHERE @row1 = 0
  UNION ALL
  SELECT TOP (ABS((ISNULL(@high,0)-ISNULL(@low,0))/ISNULL(@gap,0)+ISNULL(@row1,1)))
    RN = i.RN,
    OP = (@high-@low)/@gap+(2*@row1)-i.RN,
    N1 = (i.rn-@row1)*@gap+@low,
    N2 = (i.rn-(@row1-1))*@gap+@low
  FROM       iTally AS i
  ORDER BY   i.RN
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low 
AND   @gap > 0;
GO

CREATE FUNCTION samd.NGrams8k
(
  @string VARCHAR(8000), -- Input string
  @N      INT            -- requested token size
)
/*****************************************************************************************
[Purpose]:
 A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
 based on an input string (@string). Accepts strings up to 8000 varchar characters long.

[Author]: 
 Alan Burstein

[Compatibility]:
 SQL Server 2008+, Azure SQL Database

[Syntax]:
--===== Autonomous
 SELECT ng.position, ng.token 
 FROM   samd.NGrams8k(@string,@N) AS ng;

--===== Against a table using APPLY
 SELECT      s.SomeID, ng.position, ng.token
 FROM        dbo.SomeTable AS s
 CROSS APPLY samd.NGrams8K(s.SomeValue,@N) AS ng;

[Parameters]:
 @string  = The input string to split into tokens.
 @N       = The size of each token returned.

[Returns]:
 Position = BIGINT; the position of the token in the input string
 token    = VARCHAR(8000); a @N-sized character-level N-Gram token

[Dependencies]:
 1. dbo.rangeAB (iTVF)

[Revision History]:
------------------------------------------------------------------------------------------
 Rev 00 - 20140310 - Initial Development - Alan Burstein
 Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
                     conversion to bigint in the TOP logic to remove implicit conversion
                     to bigint - Alan Burstein
 Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less
                     than the length of @string. Updated comment section. - Alan Burstein
 Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N
                     parameters to prevent a NULL string or NULL @N from causing "an
                     improper value" being passed to the TOP clause. - Alan Burstein
 Rev 05 - 20171228 - Small simplification; changed: 
                (ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@N,1)-1)),0)))
                                           to:
                (ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))+1-ISNULL(@N,1)),0)))
 Rev 06 - 20180612 - Using CHECKSUM(N) in the to convert N in the token output instead of
                     using (CAST N as int). CHECKSUM removes the need to convert to int.
 Rev 07 - 20180612 - re-designed to: Use dbo.rangeAB - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
  position   = r.RN,
  token      = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM  dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO

CREATE FUNCTION samd.hammingDistance8K 
(
  @s1 VARCHAR(8000), -- first input string
  @s2 VARCHAR(8000)  -- second input string
)
/*****************************************************************************************
[Purpose]:
 Purely set-based iTVF that returns the Hamming Distance between two strings of equal 
 length. See: https://en.wikipedia.org/wiki/Hamming_distance

[Author]:
 Alan Burstein

[Compatibility]:
 SQL Server 2008+

[Syntax]:
--===== Autonomous
 SELECT h.HD
 FROM   samd.hammingDistance8K(@s1,@s2) AS h;

--===== Against a table using APPLY
 SELECT t.string, S2 = @s2, h.HD
 FROM   dbo.someTable AS t
 CROSS 
 APPLY  samd.hammingDistance8K(t.string, @s2) AS h;

[Parameters]:
  @s1 = VARCHAR(8000); the first input string
  @s2 = VARCHAR(8000); the second input string

[Dependencies]:
 1. samd.NGrams8K

[Examples]:
--===== 1. Basic Use
DECLARE @s1 VARCHAR(8000) = 'abc1234',
        @s2 VARCHAR(8000) = 'abc2234';

SELECT h.HD, h.L, h.minSim
FROM   samd.hammingDistance8K(@s1,@s2) AS h;

---------------------------------------------------------------------------------------
[Revision History]: 
 Rev 00 - 20180800 - Initial re-design - Alan Burstein
 Rev 01 - 20181116 - Added L (Length) and minSim
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT H.HD, H.L, minSim = 1.*(H.L-H.HD)/H.L
FROM
( 
  SELECT LEN(@s1)-SUM(CHARINDEX(ng.token,SUBSTRING(@S2,ng.position,1))), 
         CASE LEN(@s1) WHEN LEN(@s2) THEN LEN(@s1) END
  FROM   samd.NGrams8k(@s1,1) AS ng
  WHERE  LEN(@S1)=LEN(@S2)
) AS H(HD,L);
GO
Alan Burstein
  • 7,770
  • 1
  • 15
  • 18