I have a field declared as NVARCHAR(MAX) that I pull from a table. In most cases, the length is less than 1,000 characters. However, the field can contain some HTML text, so I use a scalar function that I wrote to strip out the HTML text. On really long strings (e.g., 600,000 characters), the call to strip the HTML text just sits there (I assume because that's a heck of a lot of data to pass through a scalar function). It will eventually return--about 15 minutes later, because it takes a long time to make the scalar call.
Here's the Table:
CREATE TABLE [dbo].[TextHolder](
[fldClaimTextID] [int] IDENTITY(1,1) NOT NULL,
[fldText] [nvarchar](max) NULL,
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Assume the table has data (mine has 250 million rows in it). 99.99% of the instances of fldText have less than 1,000 characters.
So a call to a basic scalar function works like a charm 99.99% of the time, using the following query:
SELECT
dbo.udf_StripHTML(fldText)
,LEN(fldText) AS fldLength
FROM
TextHolder
Here's a sampling of data from 23 records out of my 250 million or so where two results (rows 8 and 18) are way long.
fldLength
285
459
132
141
137
187
129
619182
173
327
433
643
132
141
136
187
129
690123
465
428
237
243
178
So, my only solution now is to truncate the call to udf_StripHTML, which is sub-optimal. See the example below where I only strip HTML when the field is less than 20,000 characters long):
SELECT
CASE WHEN LEN(fldText) > 20000 THEN fldText ELSE dbo.udf_StripHTML(fldText) END
,LEN(fldText) AS fldLength
FROM
TextHolder
How can I pass an nvarchar that is 600,000 characters long to a scalar function without it taking a long time.
Side note, I'm running on a 4 processor box with 256GB of memory, all dedicated to SQL Server, tied on a 20GB/s SAN, so I have no practical hardware limitations.
I appreciate any thoughts. Thanks!
UPDATE:
Here's the function:
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @strippedString nvarchar(MAX) = N''
DECLARE @currentChar nvarchar
DECLARE @ignoreCharacter bit = 0
-- Replace any <br> tags with a newline characters
SET @Start = 0
SET @End = LEN(@HTMLText)
-- This is an incremental algorithm that traverses the string only once, which means that it should be fast. It basically starts at the 0th character, going through
-- the entire string. When it finds a starting "<", it sets a flag to ignore all characters until it finds a closing ">" (if it never finds one, it truncates the rest
-- of the string. We only add characters to the return string when the ignore flag is turned off.
WHILE @Start < @End
BEGIN
SET @currentChar = SUBSTRING(@HTMLText,@Start,1)
-- Found a starting bracket, so turn on the ignore flag
IF @currentChar = N'<'
SET @ignoreCharacter = 1
-- Found an ending bracket, so turn off the ignore flag
ELSE IF @currentChar = N'>'
SET @ignoreCharacter = 0
ELSE
-- If we have a non-bracket character and the ignore flag is off, then add the character to our return value
IF @ignoreCharacter = 0
SET @strippedString = @strippedString + @currentChar
SET @Start = @Start + 1
END
RETURN @strippedString
END