3

I'm trying to create an inline table function to strip out HTML (so that I can CROSS APPLY it with another table). We have a scalar function in place, but the performance is just not good enough.

I'm really new to using T-SQL, so I don't know that many good practices of it, so if anyone has any suggestions, please let me know!

This is what I have so far:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

CREATE FUNCTION StripHTML
(   
    @text VARCHAR(MAX)
)
RETURNS TABLE 
AS
RETURN
    WITH BracketIndexes (startIndex, stopIndex) AS (SELECT
            CHARINDEX('<', @text),
            CHARINDEX('>', @text) UNION ALL SELECT
            CHARINDEX('<', @text, stopIndex + 1),
            CHARINDEX('>', @text, stopIndex + 1)
        FROM BracketIndexes
        WHERE startIndex > 0 AND stopIndex > 0
    )
    SELECT * FROM BracketIndexes

Now that I have the indices of where the <s and > are, I'm not sure how to:

  • remove the contents between those indices
  • return the string so I can cross apply it

I'm thinking I can just concat the left side and right side of each bracket using LEFT/RIGHT, but I can't seem to get the indices correct.

Edit:

This kinda worked:

DECLARE @text VARCHAR(100) = 'HELLO <B>WORLD</B> asd';

WITH BracketIndexes (startIndex, stopIndex) AS (
    SELECT
        CHARINDEX('<', @text),
        CHARINDEX('>', @text)
    UNION ALL
    SELECT
        CHARINDEX('<', @text, stopIndex + 1) - (stopIndex - startIndex + 1),
        CHARINDEX('>', @text, stopIndex + 1) - (stopIndex - startIndex + 1)
    FROM BracketIndexes
    WHERE startIndex > 0 AND stopIndex > 0
)
SELECT @text = LEFT(@text, startIndex - 1) + RIGHT(@text, LEN(@text) - stopIndex)FROM BracketIndexes WHERE startIndex > 0 AND stopIndex > 0
SELECT @text

But when when I added some more HTML statements into @text, the RIGHT had an incorrect length parameter.

2 Answers2

3

You were on the right path. With each recursion of the CTE, you want to basically replace one HTML element with a blank using a combination of REPLACE() and SUBSTRING():

DECLARE @text varchar(max) = '<a href="">Cool!</a><b>oh yeah!</b>'  

;WITH BracketIndexes (rowNumber, original, startIndex, stopIndex, replaced) AS (SELECT
        1,
        @text,
        CHARINDEX('<', @text),
        CHARINDEX('>', @text),
        REPLACE(@text,SUBSTRING(@text,CHARINDEX('<', @text),CHARINDEX('>', @text)),'') UNION ALL SELECT
        rowNumber+1,
        replaced,
        CHARINDEX('<', replaced),
        CHARINDEX('>', replaced),
        REPLACE(replaced,SUBSTRING(replaced,CHARINDEX('<', replaced),CHARINDEX('>', replaced)),'')
    FROM BracketIndexes
    WHERE startIndex > 0 AND stopIndex > 0
)
SELECT TOP 1
    replaced
FROM 
    BracketIndexes
ORDER BY RowNumber DESC
Bert Wagner
  • 851
  • 1
  • 11
  • 23
  • THANK YOU! There was a small typo though. `SUBSTRING` expects a length for the second parameter instead of an index, so I used `CHARINDEX('>', replaced) - CHARINDEX('<', replaced) + 1` instead of `CHARINDEX('>', @text)` and it worked great. –  Oct 15 '15 at 18:00
0

You can try some thing like this very simply. Ofcourse not fully tested

declare @str varchar(max)='<table><th>abc</th><tr><td>def</td></tr></table>'
select @str=replace(replace(@str,'<','~<'),'>','>~')
select * from dbo.split('~',@str) where value<>'' and SUBSTRING(value,1,1)<>'<'

Consider it as a logic only. Surely cases may fail

Binesh Nambiar C
  • 152
  • 1
  • 2
  • 9