T-SQL - Count unique characters in a variable

Question

Goal: To count # of distinct characters in a variable the fastest way possible.

DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4

I've found some posts in regards to distinct characters in a column, grouped by characters, and etc, but not one for this scenario.

See this question, it is for mysql, but I believe it will work: https://stackoverflow.com/questions/29967280/count-number-of-unique-characters-in-a-string — Kevin, Nov 07 '19 at 15:48

Thom A · Accepted Answer · 2019-11-07T15:58:00.357

Using NGrams8K as a base, you can change the input parameter to a nvarchar(4000) and tweak the DATALENGTH, making NGramsN4K. Then you can use that to split the string into individual characters and count them:

SELECT COUNT(DISTINCT NG.token) AS DistinctCharacters
FROM dbo.NGramsN4k(@String1,1) NG;

Altered NGrams8K:

IF OBJECT_ID('dbo.NGramsN4k','IF') IS NOT NULL DROP FUNCTION dbo.NGramsN4k;
GO
CREATE FUNCTION dbo.NGramsN4k
(
  @string nvarchar(4000), -- Input string 
  @N      int            -- requested token size
)
/****************************************************************************************
Purpose:
 A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens 
 based on an input string (@string). Accepts strings up to 8000 varchar characters long.
 For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram. 

Compatibility: 
 SQL Server 2008+, Azure SQL Database

Syntax:
--===== Autonomous
 SELECT position, token FROM dbo.NGrams8k(@string,@N);

--===== Against a table using APPLY
 SELECT s.SomeID, ng.position, ng.token
 FROM dbo.SomeTable s
 CROSS APPLY dbo.NGrams8K(s.SomeValue,@N) ng;

Parameters:
 @string  = The input string to split into tokens.
 @N       = The size of each token returned.

Returns:
 Position = bigint; the position of the token in the input string
 token    = varchar(8000); a @N-sized character-level N-Gram token

Developer Notes:  
 1. NGrams8k is not case sensitive

 2. Many functions that use NGrams8k will see a huge performance gain when the optimizer
    creates a parallel execution plan. One way to get a parallel query plan (if the 
    optimizer does not chose one) is to use make_parallel by Adam Machanic which can be 
    found here:
 sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

 3. When @N is less than 1 or greater than the datalength of the input string then no 
    tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
    This is a debatable topic but the thinking behind this decision is that: because you
    can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you 
    can't turn anything into NULL-grams, no rows should be returned.

    For people who would prefer that a NULL input forces the function to return a single
    NULL output you could add this code to the end of the function:

    UNION ALL 
    SELECT 1, NULL
    WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL)

 4. NGrams8k can also be used as a tally table with the position column being your "N" 
    row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
    it into unigrams then only return the position column. NGrams8k will get you up to 
    8000 numbers. There will be no performance penalty for sorting by position in 
    ascending order but there is for sorting in descending order. To get the numbers in
    descending order without forcing a sort in the query plan use the following formula:
    N = <highest number>-position+1. 

 Pseudo Tally Table Examples:
    --===== (1) Get the numbers 1 to 100 in ascending order:
    SELECT N = position 
    FROM dbo.NGrams8k(REPLICATE(0,100),1);

    --===== (2) Get the numbers 1 to 100 in descending order:
    DECLARE @maxN int = 100;
    SELECT N = @maxN-position+1
    FROM dbo.NGrams8k(REPLICATE(0,@maxN),1)
    ORDER BY position;

 5. NGrams8k is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
 SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (@N=1)
 SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams  (@N=2)
 SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (@N=3)

--===== How many times the substring "AB" appears in each record
 DECLARE @table TABLE(stringID int identity primary key, string varchar(100));
 INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');

 SELECT string, occurances = COUNT(*) 
 FROM @table t
 CROSS APPLY dbo.NGrams8k(t.string,2) ng
 WHERE ng.token = 'AB'
 GROUP BY string;

----------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20140310 - Initial Development - Alan Burstein
 Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
                     conversion to bigint in the TOP logic to remove implicit conversion
                     to bigint - Alan Burstein
 Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less 
                     than the length of @string. Updated comment section. - Alan Burstein
 Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N 
                     parameters to prevent a NULL string or NULL @N from causing "an 
                     improper value" being passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH 
L1(N) AS 
(
  SELECT 1
  FROM (VALUES    -- 90 NULL values used to create the CTE Tally table
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),
        (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)
       ) t(N)
),
iTally(N) AS                                   -- my cte Tally table
(
  SELECT TOP(ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,N''))/2)-(ISNULL(@N,1)-1)),0)))
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort
  FROM L1 a CROSS JOIN L1 b                    -- cartesian product for 8100 rows (90^2)
)
SELECT
  position = N,                                -- position of the token in the string(s)
  token    = SUBSTRING(@string,CAST(N AS int),@N)  -- the @N-Sized token
FROM iTally
WHERE @N > 0 AND @N <= (DATALENGTH(@string)/2);    -- Protection against bad parameter values

Side note: you don't need 90 nulls in your `L1(N)`, only 64 (since 64^2 = 4096)... — Zohar Peled, Nov 07 '19 at 16:07
True, @ZoharPeled; that was the "lazy" part of me changing the function. I didn't want to do *all* the work for the OP ;) — Thom A, Nov 07 '19 at 16:10
@ZoharPeled See my version below and here: https://stackoverflow.com/questions/47082399/remove-all-comments-from-a-text-in-t-sql?rq=1 — Alan Burstein, Nov 08 '19 at 15:38
@AlanBurstein you should submit that as an additional edit to your existing SSC article. Having both a `varchar` and `nvarchar` version could be really important for people; especially as they may be caught out by `DATALENGTH`. — Thom A, Nov 08 '19 at 15:42

Sean Lange · Answer 2 · 2019-11-08T14:24:03.210

3

Here is another alternative using the power of the tally table. It has been called the "Swiss Army Knife of T-SQL". I keep a tally table as a view on my system which makes it insanely fast.

create View [dbo].[cteTally] as

WITH
    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),
    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
    cteTally(N) AS 
    (
        SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
    )
select N from cteTally

Now we can use that tally anytime we need it, like for this exercise.

declare @Something table
(
    String1 nvarchar(4000)
)

insert @Something values
(N'1A^')
, (N'11')
, (N'*')
, (N'*A-zz')

select count(distinct substring(s.String1, t.N, 1))
    , s.String1
from @Something s
join cteTally t on t.N <= len(s.String1)
group by s.String1

To be honest I don't know this would be any faster than Larnu's usage of NGrams but testing on a large table would be fun to see.

----- EDIT -----

Thanks to Shnugo for the idea. Using a cross apply to a correlated subquery here is actually quite an improvement.

select count(distinct substring(s.String1, A.N, 1))
    , s.String1
from @Something s
CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)
group by s.String1

The reason this is so much faster is that this is no longer using a triangular join which can really be painfully slow. I did also switch out the view with an indexed physical tally table. The improvement there was noticeable on larger datasets but not nearly as big as using the cross apply.

If you want to read more about triangular joins and why we should avoid them Jeff Moden has a great article on the topic. https://www.sqlservercentral.com/articles/hidden-rbar-triangular-joins

edited Nov 08 '19 at 14:24

answered Nov 07 '19 at 16:16

Sean Lange

33,028
3
25
40

Thank you for a suggestion Sean. I like this non-f(x) Tally table approach too. Cheers! – 007 Nov 07 '19 at 18:34
@007, if you want neither a function nor a view, you can even use this with the cte ahead of your query (*tally-on-the-fly*). – Shnugo Nov 08 '19 at 08:27
Sean, great approach, +1 from my side. Even better might be an indexed physical numbers table, but this was something to check... And one more suggestion might be to use `CROSS APPLY (SELECT TOP(LEN(s.String1)) t.N FROM cteTally t) A(N)`. I can imagine, that a computed `TOP()`-clause works faster in this case... Again something to check :-) – Shnugo Nov 08 '19 at 08:30
@Shnugo switching out the join for a cross apply makes a way bigger improvement than I would have suspected. The physical indexed table makes a simpler execution plan and a little bit of improvement on the original. But the real winner is using cross apply. I am normally in tune with avoiding triangular joins like this and this is a good example of why they should be avoided. Thanks for the nudge. – Sean Lange Nov 08 '19 at 14:18
@SeanLange Great findings! To add one more idea: I'd change the VIEW to a iTVF, taking a `@lowerBoundary` and an `@upperBoundary` as parameters. This would shift the computed `TOP` right to the place where the running list is created. I don't know if the engine is smart enough to reduce the tally's set early enough... – Shnugo Nov 08 '19 at 15:13
In [this answer](https://stackoverflow.com/a/39387790/5089204) I used `SELECT TOP(@end-@start +1) (ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) + @start -1) * @step` to compute the `TOP` and to add a step size (if you'd need just even numbers or something like this... – Shnugo Nov 08 '19 at 15:16

Alan Burstein · Answer 3 · 2019-11-08T15:36:58.790

Grab a copy of NGrams8k and you can do this:

DECLARE @String1 NVARCHAR(4000) = N'1A^' ; --> output = 3
DECLARE @String2 NVARCHAR(4000) = N'11' ; --> output = 1
DECLARE @String3 NVARCHAR(4000) = N'*' ; --> output = 1
DECLARE @String4 NVARCHAR(4000) = N'*A-zz' ; --> output = 4

SELECT     s.String, Total = COUNT(DISTINCT ng.token)
FROM       (VALUES(@String1),(@String2),(@String3),(@String4)) AS s(String)
CROSS APPLY dbo.NGrams8k(s.String,1)                           AS ng
GROUP BY    s.String;

Returns:

String   Total
-------- -----------
*        1
*A-zz    4
11       1
1A^      3

UPDATED Just a quick update based on @Larnu's post and comments. I did not notice that the OP was dealing with Unicode e.g. NVARCHAR. I created an NVARCHAR(4000) version here - similar to what @Larnu posted above. I just updated the return token to use Latin1_General_BIN collation.

SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N)

This returns the correct answer:

DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2

SELECT   COUNT(DISTINCT ng.token)
FROM     dbo.NGramsN4k(@String5,1) AS ng;

Without the collation in place you can use the what Larnu posted and get the right answer like this:

DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ'; --> output = 2

SELECT    COUNT(DISTINCT UNICODE(ng.token))
FROM     dbo.NGramsN4k(@String5,1) AS ng;

Here's my updated NGramsN4K function:

ALTER FUNCTION dbo.NGramsN4K
(
  @string nvarchar(4000), -- Input string
  @N      int             -- requested token size
)
/****************************************************************************************
Purpose:
 A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens 
 based on an input string (@string). Accepts strings up to 4000 nvarchar characters long.
 For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram. 

Compatibility: 
 SQL Server 2008+, Azure SQL Database

Syntax:
--===== Autonomous
 SELECT position, token FROM dbo.NGramsN4K(@string,@N);

--===== Against a table using APPLY
 SELECT s.SomeID, ng.position, ng.token
 FROM dbo.SomeTable s
 CROSS APPLY dbo.NGramsN4K(s.SomeValue,@N) ng;

Parameters:
 @string  = The input string to split into tokens.
 @N       = The size of each token returned.

Returns:
 Position = bigint; the position of the token in the input string
 token    = nvarchar(4000); a @N-sized character-level N-Gram token

Developer Notes:  
 1. NGramsN4K is not case sensitive

 2. Many functions that use NGramsN4K will see a huge performance gain when the optimizer
    creates a parallel execution plan. One way to get a parallel query plan (if the 
    optimizer does not chose one) is to use make_parallel by Adam Machanic which can be 
    found here:
 sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

 3. When @N is less than 1 or greater than the datalength of the input string then no 
    tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
    This is a debatable topic but the thinking behind this decision is that: because you
    can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you 
    can't turn anything into NULL-grams, no rows should be returned.

    For people who would prefer that a NULL input forces the function to return a single
    NULL output you could add this code to the end of the function:

    UNION ALL 
    SELECT 1, NULL
    WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL);

 4. NGramsN4K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
 SELECT position, token FROM dbo.NGramsN4K('abcd',1); -- unigrams (@N=1)
 SELECT position, token FROM dbo.NGramsN4K('abcd',2); -- bigrams  (@N=2)
 SELECT position, token FROM dbo.NGramsN4K('abcd',3); -- trigrams (@N=3)

--===== How many times the substring "AB" appears in each record
 DECLARE @table TABLE(stringID int identity primary key, string nvarchar(100));
 INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');

 SELECT string, occurances = COUNT(*) 
 FROM @table t
 CROSS APPLY dbo.NGramsN4K(t.string,2) ng
 WHERE ng.token = 'AB'
 GROUP BY string;

------------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20170324 - Initial Development - Alan Burstein
 Rev 01 - 20191108 - Added Latin1_General_BIN collation to token output - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
  SELECT 1 FROM (VALUES -- 64 dummy values to CROSS join for 4096 rows
        ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
        ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
        ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),
        ($),($),($),($),($),($),($),($),($),($),($),($),($),($),($),($)) t(N)
),
iTally(N) AS 
(
  SELECT 
  TOP (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)))
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))    -- Order by a constant to avoid a sort
  FROM L1 a CROSS JOIN L1 b                       -- cartesian product for 4096 rows (16^2)
)
SELECT
  position = N,                                   -- position of the token in the string(s)
  token    = SUBSTRING(@string COLLATE Latin1_General_BIN,CAST(N AS int),@N) -- the @N-Sized token
FROM iTally
WHERE @N > 0  -- Protection against bad parameter values:
AND   @N <= (ABS(CONVERT(BIGINT,((DATALENGTH(ISNULL(@string,''))/2)-(ISNULL(@N,1)-1)),0)));

Careful with your own function here, Alan; the OP is using `nvarchar` and `NGrams8K` makes use of `DATALENGTH` and expects an `varchar(8000)`. :) A value like `DECLARE @String5 NVARCHAR(4000) = N'ᡣᓡ';` would incorrectly return the value `1`. — Thom A, Nov 07 '19 at 15:51
*(Though on my collation, so does "`NGramsN4K`"; but `NGrams8k` would return 1 due to the returned characters being `'?'` rather than the collation treating them as the same)* — Thom A, Nov 07 '19 at 16:03
This is an awesome f(x) Alan!! Thank you for sharing. Cheers! — 007, Nov 07 '19 at 18:32

score 1 · Answer 4 · answered Nov 07 '19 at 15:52

You can do this natively in SQL Server using CTE and some string manipuation:

DECLARE @TestString NVARCHAR(4000);
SET @TestString = N'*A-zz';

WITH letters AS
(
    SELECT  1 AS Pos,
            @TestString AS Stri,
            MAX(LEN(@TestString)) AS MaxPos,
            SUBSTRING(@TestString, 1, 1) AS [Char]
    UNION ALL
    SELECT  Pos + 1,
            @TestString,
            MaxPos,
            SUBSTRING(@TestString, Pos + 1, 1) AS [Char]
      FROM  letters
      WHERE Pos + 1 <= MaxPos
)
SELECT  COUNT(*) AS LetterCount
  FROM  (
        SELECT  UPPER([Char]) AS [Char]
          FROM  letters
          GROUP BY [Char]
        ) a

Example outputs:

SET @TestString = N'*A-zz';
{execute code}
LetterCount = 4

SET @TestString = N'1A^';
{execute code}
LetterCount = 3

SET @TestString = N'1';
{execute code}
LetterCount = 1

SET @TestString = N'*';
{execute code}
LetterCount = 1

Warning, as the input is an `nvarchar(4000)` this will fail, as it stands, with a string with more than 100 characters. — Thom A, Nov 07 '19 at 15:59

score 1 · Answer 5 · answered Nov 07 '19 at 15:56

CREATE TABLE #STRINGS(
    STRING1 NVARCHAR(4000)
)

INSERT INTO #STRINGS (
    STRING1
)
VALUES
(N'1A^'),(N'11'),(N'*'),(N'*A-zz')


;WITH CTE_T AS (
    SELECT DISTINCT
        S.STRING1
        ,SUBSTRING(S.STRING1, V.number + 1, 1) AS Val
    FROM
        #STRINGS S
    INNER JOIN
        [master]..spt_values V
        ON V.number < LEN(S.STRING1)
    WHERE
        V.[type] = 'P'
)
SELECT
    T.STRING1
    ,COUNT(1) AS CNT
FROM
    CTE_T T
GROUP BY
    T.STRING1

Thank you for a suggestion Bob. Cheers! – 007 Nov 07 '19 at 18:33 — 007, Nov 07 '19 at 18:33

T-SQL - Count unique characters in a variable

5 Answers5

----- EDIT -----