It just occurred to me that I wrote a function a few days ago that is perfect for this type of thing. The function is below, note my code comments for more details about the function.
Solution
select itemNumber = tokenlen/2, leftToken
from dbo.edgeNgrams8k('/A/B/C/D/E')
where tokenlen % 2 = 0;
Results
itemNumber leftToken
-------------------- -----------
1 /A
2 /A/B
3 /A/B/C
4 /A/B/C/D
5 /A/B/C/D/E
The function
if object_id('dbo.edgeNgrams8k', 'IF') is not null drop function dbo.edgeNgrams8k;
go
create function dbo.edgeNgrams8k(@string varchar(8000))
/*****************************************************************************************
Purpose
edgeNgrams8k is an inline table valued function (itvf) that accepts a varchar(8000)
input string (@string) and returns a series of character-level left and right edge
n-grams. An edge n-gram (referred to herin as an "edge-gram" for brevity) is a type of
n-gram (see https://en.wikipedia.org/wiki/N-gram). Instead of a contiguous series of
n-sized tokens (n-grams), however, an edge n-gram is a series of tokens that that begin
with the input string's first character then increases by one character, the next in the
string, unitl the token is as long as the input string.
Left edge-grams start at the beginning of the string and grow from left-to-right. Right
edge-grams begin at the end of the string and grow from right-to-left. Note this query
and the result-set:
select * from dbo.edgeNgrams8k('ABC');
tokenlen leftToken rightTokenIndex righttoken
---------- ------------ ---------------- ----------
1 A 3 C
2 AB 2 BC
3 ABC 1 ABC
Developer Notes:
1. For more about N-Grams in SQL Server see: http://www.sqlservercentral.com/articles/Tally+Table/142316/
For more about Edge N-Grams see the documentation by Elastic here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
2. dbo.edgeNgrams8k is deterministic. For more about determinism see: https://learn.microsoft.com/en-us/sql/relational-databases/user-defined-functions/deterministic-and-nondeterministic-functions
3. If you need to sort this data without getting a sort in your execution plan you can
sort by tokenLen for ascending order, or by rightTokenIndex for descending order.
------------------------------------------------------------------------------------------
Usage Examples:
I need to turn /A/B/C/D/E into:
/A
/A/B
.....
/A/B/C/D/E
select leftToken
from dbo.edgeNgrams8k('/A/B/C/D/E')
where tokenLen % 2 = 0
------------------------------------------------------------------------------------------
History:
20171125 - Initial Development - Developed by Alan Burstein
*****************************************************************************************/
returns table with schemabinding as return
with iTally(n) as
(
select top (len(@string)) row_number() over (order by (select $))
from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a(x), -- 10^1 = 10
(values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b(x), -- 10^2 = 100
(values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) c(x), -- 10^3 = 1000
(values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d(x) -- 10^4 = 10000
)
select top (convert(bigint, len(@string), 0))
tokenlen = n,
leftToken = substring(@string,1,n),
rightTokenIndex = len(@string)+1-n,
righttoken = substring(@string,len(@string)+1-n, n)
from itally;
go
UPDATE - PERFORMANCE TESTING
To emphasize my point about why set based is the way to go, I prepared a 100K Row test.
First the recursive CTE solution, then my first solution in as inline table valued functions. You want INLINE table valued functions as they can benefit from parallel processing as I'll show you in a moment. The functions:
-- Gordon's logic as an inline table valued function
create function dbo.rCTE_GL (@string varchar(8000))
returns table as return
with x as (select @string as col),
cte as (
select col
from x
union all
select left(col, len(col) - charindex('/', reverse(col))) as col
from cte
where col like '/%/%'
)
select *
from cte;
GO
-- My logic as a table valued function
create function dbo.tally_AB(@string varchar(8000))
returns table as return
with iTally(n) as
( select top (len(@string)/2) (row_number() over (order by (select null))-1)*2+2
from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) a(x),
(values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) b(x)) -- up to 100 character
select txt = substring(reOrder.newString, 1, n)
from iTally
cross apply
(
select '/'+substring(@string,n,1)
from iTally
order by substring(@string,n,1)
for xml path('')
) reOrder(newString);
GO
The Performance test
I'm generating 100K rows with an id so we can tell where the string came from. First the rCTE solution then each of my solutions with a serial execution plan and a parallel execution plan (using trace flag 8649).
-- sample data
if object_id('tempdb..#strings') is not null drop table #strings;
create table #strings(id int identity, string varchar(20))
insert #strings(string)
select top (100000)
isnull(left(stuff(stuff(stuff(stuff(stuff('/'+
replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(
newid(),'-',''),'0',''),'9',''),'8',''),'7',''),'6',''),'5',''),'4',''),'3',''),'2',''),'1',''),
3,0,'/'),5,0,'/'),7,0,'/'),9,0,'/'),11,0,'/'),12 - (abs(checksum(newid())%2) * 2)),'/A/B/C/')
from sys.all_columns a, sys.all_columns b;
-- Performance test
set nocount on;
set statistics io, time on
PRINT char(13)+char(10)+'Gordon (unsorted)'+char(13)+char(10)+replicate('-',150);
declare @throwaway varchar(8000);
select @throwaway = col
from #strings s
cross apply dbo.rCTE_GL(s.string);
PRINT char(13)+char(10)+'Alan (sorted) - serial'+char(13)+char(10)+replicate('-',150);
select @throwaway = txt
from #strings s
cross apply dbo.tally_AB(s.string);
PRINT char(13)+char(10)+'Alan (sorted) - parallel'+char(13)+char(10)+replicate('-',150);
select @throwaway = txt
from #strings s
cross apply dbo.tally_AB(s.string)
option (querytraceon 8649);
PRINT char(13)+char(10)+'Alan (unsorted) - serial'+char(13)+char(10)+replicate('-',150);
select @throwaway = leftToken
from #strings s
cross apply dbo.edgeNgrams8k(s.string)
where tokenLen % 2 = 0;
PRINT char(13)+char(10)+'Alan (unsorted) - parallel'+char(13)+char(10)+replicate('-',150);
select @throwaway = leftToken
from #strings s
cross apply dbo.edgeNgrams8k(s.string)
where tokenLen % 2 = 0
option (querytraceon 8649);
set statistics io, time off;
GO
Results
Gordon (unsorted)
------------------------------------------------------------------------------------------------------------------------------------------------------
Table 'Worktable'. **Scan count 100001, logical reads 3492199,** physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
**CPU time = 4625 ms, elapsed time = 4721 ms.**
Alan (sorted) - serial
------------------------------------------------------------------------------------------------------------------------------------------------------
Table 'Worktable'. **Scan count 20979, logical reads 563853**, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 1, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
**CPU time = 1782 ms, elapsed time = 1790 ms.**
Alan (sorted) - parallel
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. **Scan count 20979, logical reads 563860**, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
**CPU time = 3762 ms, elapsed time = 992 ms.**
Alan (unsorted) - serial
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
**SQL Server Execution Times:
CPU time = 219 ms, elapsed time = 217 ms.
Alan (unsorted) - parallel
------------------------------------------------------------------------------------------------------------------------------------------------------
Table '#strings____________________________________________________________________________________________________________00000000004C'.
Scan count 9, logical reads 346, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
**SQL Server Execution Times:
CPU time = 393 ms, elapsed time = 101 ms.**
There you have it. Depending on what you need and how many CPUs you're using, the tally table solution got it done 4-40 times faster than the recursive CTE and with only a fraction of the reads.