Longest NVARCHAR Value Passed To Scalar Function

Question

I have a field declared as NVARCHAR(MAX) that I pull from a table. In most cases, the length is less than 1,000 characters. However, the field can contain some HTML text, so I use a scalar function that I wrote to strip out the HTML text. On really long strings (e.g., 600,000 characters), the call to strip the HTML text just sits there (I assume because that's a heck of a lot of data to pass through a scalar function). It will eventually return--about 15 minutes later, because it takes a long time to make the scalar call.

Here's the Table:

CREATE TABLE [dbo].[TextHolder](
    [fldClaimTextID] [int] IDENTITY(1,1) NOT NULL,
    [fldText] [nvarchar](max) NULL,
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Assume the table has data (mine has 250 million rows in it). 99.99% of the instances of fldText have less than 1,000 characters.

So a call to a basic scalar function works like a charm 99.99% of the time, using the following query:

SELECT
    dbo.udf_StripHTML(fldText)
    ,LEN(fldText) AS fldLength
FROM
    TextHolder

Here's a sampling of data from 23 records out of my 250 million or so where two results (rows 8 and 18) are way long.

So, my only solution now is to truncate the call to udf_StripHTML, which is sub-optimal. See the example below where I only strip HTML when the field is less than 20,000 characters long):

SELECT
    CASE WHEN LEN(fldText) > 20000 THEN fldText ELSE dbo.udf_StripHTML(fldText) END
    ,LEN(fldText) AS fldLength
FROM
    TextHolder

How can I pass an nvarchar that is 600,000 characters long to a scalar function without it taking a long time.

Side note, I'm running on a 4 processor box with 256GB of memory, all dedicated to SQL Server, tied on a 20GB/s SAN, so I have no practical hardware limitations.

I appreciate any thoughts. Thanks!

UPDATE:

Here's the function:

ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @strippedString nvarchar(MAX) = N''
DECLARE @currentChar nvarchar
DECLARE @ignoreCharacter bit = 0

-- Replace any <br> tags with a newline characters
SET @Start = 0
SET @End = LEN(@HTMLText)

-- This is an incremental algorithm that traverses the string only once, which means that it should be fast. It basically starts at the 0th character, going through
-- the entire string. When it finds a starting "<", it sets a flag to ignore all characters until it finds a closing ">" (if it never finds one, it truncates the rest
-- of the string. We only add characters to the return string when the ignore flag is turned off.
WHILE @Start < @End
    BEGIN
        SET @currentChar = SUBSTRING(@HTMLText,@Start,1)

        -- Found a starting bracket, so turn on the ignore flag
        IF @currentChar = N'<'
            SET @ignoreCharacter = 1
        -- Found an ending bracket, so turn off the ignore flag
        ELSE IF @currentChar = N'>'
            SET @ignoreCharacter = 0
        ELSE
            -- If we have a non-bracket character and the ignore flag is off, then add the character to our return value
            IF @ignoreCharacter = 0
                SET @strippedString = @strippedString + @currentChar

        SET @Start = @Start + 1
    END

RETURN @strippedString

END

The *passing* of the argument to the scalar function is unlikely to be a problem (although in all honesty, I have only passed string values with up to a couple hundred thousand bytes in SQL Server). The problem is the underlying processing of the function, so you need to include the body of the function in the question. — Gordon Linoff, Apr 03 '16 at 03:24
If the function is the bottle neck you might want to try optimizing it. You could also store the stripped version of text. — Vincent, Apr 03 '16 at 03:37
It seems to work okay with 200,000 characters or so. Around 600,000, things got bad.I thought about storing the stripped version, but the total data is about 190GB. — Mike P., Apr 03 '16 at 03:59
a better alternative would be to do this in the presentation layer. you can strip all html tags, including script blocks and comments in a single, fairly simple regular expression. that would take a fraction of the time it takes for sql server to loop through the string of 600,000 chars. Another option is to use a clr function that can also take advantage of the .net framework to use regular expressions. — Zohar Peled, Apr 03 '16 at 05:29
Your T-SQL function may be not `O(n)` as you think, but `O(n*n)`. As a result, it is slow with large strings. `SET @strippedString = @strippedString + @currentChar` may be quadratic. `SUBSTRING(@HTMLText,@Start,1)` may also be quadratic. I'd write a CLR function or do this processing outside SQL Server altogether. — Vladimir Baranov, Apr 03 '16 at 11:12
Zohar, it's a data mining and processing issue I'm dealing with, not a presentation layer issue that I'm dealing with. Unfortunately, moving the operation to the presentation layer does not work. Vladimir, I'm with you--I wrote a CLR function in the meantime to test, which is of course rocket-fast. Thanks both for the comments/thoughts. — Mike P., Apr 03 '16 at 16:38
FYI, the CLR function executed in under 2 seconds. Vladimir, your suspicions on O(n) timing have empirical roots from my perspective. Thanks! — Mike P., Apr 04 '16 at 02:45

score 0 · Answer 1 · edited May 23 '17 at 11:45

0

unfortunately there is not equivalent to oracle's regexp_replace in mssql server. so there are two workarounds. This one here as it was answered before. it is not ideal but I think the performance will improve. Or this which is a way to bring the .net regular expressions to SQL which I think is easy and will also improve performance significantly but maintenance can get messy.

edited May 23 '17 at 11:45

Community

1
1

answered Apr 04 '16 at 06:00

Gubr

324
1
7
15

I wound up writing a CLR function, but thanks for the link to the Regular Expression project. Good stuff! – Mike P. Apr 04 '16 at 12:49

score 0 · Answer 2 · answered Apr 04 '16 at 06:39

0

A friend of mine also have another suggestion that might improve your current solution. he suggests that you split your string with character '<' then remove the first part of all the entries to the character '>' the join the list again.

answered Apr 04 '16 at 06:39

Gubr

324
1
7
15

Longest NVARCHAR Value Passed To Scalar Function

2 Answers2