0

I have a field declared as NVARCHAR(MAX) that I pull from a table. In most cases, the length is less than 1,000 characters. However, the field can contain some HTML text, so I use a scalar function that I wrote to strip out the HTML text. On really long strings (e.g., 600,000 characters), the call to strip the HTML text just sits there (I assume because that's a heck of a lot of data to pass through a scalar function). It will eventually return--about 15 minutes later, because it takes a long time to make the scalar call.

Here's the Table:

CREATE TABLE [dbo].[TextHolder](
    [fldClaimTextID] [int] IDENTITY(1,1) NOT NULL,
    [fldText] [nvarchar](max) NULL,
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Assume the table has data (mine has 250 million rows in it). 99.99% of the instances of fldText have less than 1,000 characters.

So a call to a basic scalar function works like a charm 99.99% of the time, using the following query:

SELECT
    dbo.udf_StripHTML(fldText)
    ,LEN(fldText) AS fldLength
FROM
    TextHolder

Here's a sampling of data from 23 records out of my 250 million or so where two results (rows 8 and 18) are way long.

fldLength
285
459
132
141
137
187
129
619182
173
327
433
643
132
141
136
187
129
690123
465
428
237
243
178

So, my only solution now is to truncate the call to udf_StripHTML, which is sub-optimal. See the example below where I only strip HTML when the field is less than 20,000 characters long):

SELECT
    CASE WHEN LEN(fldText) > 20000 THEN fldText ELSE dbo.udf_StripHTML(fldText) END
    ,LEN(fldText) AS fldLength
FROM
    TextHolder

How can I pass an nvarchar that is 600,000 characters long to a scalar function without it taking a long time.

Side note, I'm running on a 4 processor box with 256GB of memory, all dedicated to SQL Server, tied on a 20GB/s SAN, so I have no practical hardware limitations.

I appreciate any thoughts. Thanks!

UPDATE:

Here's the function:

ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @strippedString nvarchar(MAX) = N''
DECLARE @currentChar nvarchar
DECLARE @ignoreCharacter bit = 0

-- Replace any <br> tags with a newline characters
SET @Start = 0
SET @End = LEN(@HTMLText)

-- This is an incremental algorithm that traverses the string only once, which means that it should be fast. It basically starts at the 0th character, going through
-- the entire string. When it finds a starting "<", it sets a flag to ignore all characters until it finds a closing ">" (if it never finds one, it truncates the rest
-- of the string. We only add characters to the return string when the ignore flag is turned off.
WHILE @Start < @End
    BEGIN
        SET @currentChar = SUBSTRING(@HTMLText,@Start,1)

        -- Found a starting bracket, so turn on the ignore flag
        IF @currentChar = N'<'
            SET @ignoreCharacter = 1
        -- Found an ending bracket, so turn off the ignore flag
        ELSE IF @currentChar = N'>'
            SET @ignoreCharacter = 0
        ELSE
            -- If we have a non-bracket character and the ignore flag is off, then add the character to our return value
            IF @ignoreCharacter = 0
                SET @strippedString = @strippedString + @currentChar

        SET @Start = @Start + 1
    END

RETURN @strippedString

END
Mike P.
  • 1
  • 3
  • The *passing* of the argument to the scalar function is unlikely to be a problem (although in all honesty, I have only passed string values with up to a couple hundred thousand bytes in SQL Server). The problem is the underlying processing of the function, so you need to include the body of the function in the question. – Gordon Linoff Apr 03 '16 at 03:24
  • If the function is the bottle neck you might want to try optimizing it. You could also store the stripped version of text. – Vincent Apr 03 '16 at 03:37
  • It seems to work okay with 200,000 characters or so. Around 600,000, things got bad.I thought about storing the stripped version, but the total data is about 190GB. – Mike P. Apr 03 '16 at 03:59
  • 2
    a better alternative would be to do this in the presentation layer. you can strip all html tags, including script blocks and comments in a single, fairly simple regular expression. that would take a fraction of the time it takes for sql server to loop through the string of 600,000 chars. Another option is to use a clr function that can also take advantage of the .net framework to use regular expressions. – Zohar Peled Apr 03 '16 at 05:29
  • Your T-SQL function may be not `O(n)` as you think, but `O(n*n)`. As a result, it is slow with large strings. `SET @strippedString = @strippedString + @currentChar` may be quadratic. `SUBSTRING(@HTMLText,@Start,1)` may also be quadratic. I'd write a CLR function or do this processing outside SQL Server altogether. – Vladimir Baranov Apr 03 '16 at 11:12
  • Zohar, it's a data mining and processing issue I'm dealing with, not a presentation layer issue that I'm dealing with. Unfortunately, moving the operation to the presentation layer does not work. Vladimir, I'm with you--I wrote a CLR function in the meantime to test, which is of course rocket-fast. Thanks both for the comments/thoughts. – Mike P. Apr 03 '16 at 16:38
  • FYI, the CLR function executed in under 2 seconds. Vladimir, your suspicions on O(n) timing have empirical roots from my perspective. Thanks! – Mike P. Apr 04 '16 at 02:45

2 Answers2

0

unfortunately there is not equivalent to oracle's regexp_replace in mssql server. so there are two workarounds. This one here as it was answered before. it is not ideal but I think the performance will improve. Or this which is a way to bring the .net regular expressions to SQL which I think is easy and will also improve performance significantly but maintenance can get messy.

Community
  • 1
  • 1
Gubr
  • 324
  • 1
  • 7
  • 15
  • I wound up writing a CLR function, but thanks for the link to the Regular Expression project. Good stuff! – Mike P. Apr 04 '16 at 12:49
0

A friend of mine also have another suggestion that might improve your current solution. he suggests that you split your string with character '<' then remove the first part of all the entries to the character '>' the join the list again.

Gubr
  • 324
  • 1
  • 7
  • 15