20

I have the following records in one of my table

CD&M Communications 
auburndale oil & propane inc  
C F La Fountaine #7561  
Laramie County Fire District # 2  
AmeriGas Propane LP #2250  

Is there a way to remove the characters like &, #7561, #2250 etc.

"&" should be replaced with "&" as per C# HTMLDECODE function

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84
VPP
  • 731
  • 1
  • 9
  • 34

4 Answers4

27

There is a much easier solution...

SQL Server supports the XML datatype, and it supports decoding XML/HTML encoded entities. If you just cast the string to the XML datatype, you can use the built in decode function.

That would look like this:

select cast('Q & A' as XML).value('.[1]','nvarchar(max)' );

To turn it into a function for easy use:

create function dbo.xmlDecode (@string nvarchar(max))
returns varchar(max)
begin
    return cast(@string as XML).value('.[1]','nvarchar(max)' )
end;

Keep in mind that in OP's example, the string seems to have been encoded 3 times in a row. & got turned into & then into & and then into &. As a result, to get the "original" string back, you have to use the decode function 3 times.

Wouter
  • 1,829
  • 3
  • 28
  • 34
  • 1
    One user pointed out that this conversion fails when the string contains the characters < or >. I have not been able to reproduce this issue. If anyone encounters this, I would be interested in an example case. – Wouter May 04 '18 at 09:00
  • 1
    There are 5 characters that break this solution: `<, >, &, ", '` For example: try `select cast('Q&A' as XML).value('.[1]','nvarchar(max)' )` – egerardus May 21 '19 at 00:28
  • @Geronimo: aha, thanks for that clarification. Indeed, when the string being cast to XML contains characters that are *not* XML encoded, an error is thrown. Because of this, this approach will not work for mixed-encoded strings, but only for strings that have been fully XML-encoded one or multiple times. – Wouter May 21 '19 at 09:03
  • This approach will work in most use-cases, yet the currently accepted answer by @sathish remains a good "plan B" approach for the mixed-encoding cases. – Wouter May 21 '19 at 09:05
  • 1
    I liked your use of casting because it covers the HEX entities that the accepted answer does not. I have plagarized it below and added handling for XML entities. – egerardus May 21 '19 at 16:24
  • 2
    Note that is will generally not work, as XML encoding is not the same as HTML encoding which this question is about. E.g. it will fail at `'A B'` with *"XML parsing: line 1, character 7, well formed check: undeclared entity"*. – GSerg Jun 28 '19 at 12:53
13

The following SQL function would work in your case or it would be a good starting point for you to extend it. However, please note the String manipulations in the Database [SQL Server] would be slower compared to the string manipulations in application layer.

GO

IF OBJECT_ID('dbo.MyHTMLDecode') IS NOT NULL BEGIN DROP FUNCTION dbo.MyHTMLDecode END

GO
CREATE FUNCTION dbo.MyHTMLDecode (@vcWhat VARCHAR(MAX))
RETURNS VARCHAR(MAX)
AS
BEGIN
    DECLARE @vcResult VARCHAR(MAX)
    DECLARE @siPos INT
        ,@vcEncoded VARCHAR(7)
        ,@siChar INT

    SET @vcResult = RTRIM(LTRIM(CAST(REPLACE(@vcWhat COLLATE Latin1_General_BIN, CHAR(0), '') AS VARCHAR(MAX))))

    SELECT @vcResult = REPLACE(REPLACE(@vcResult, '&#160;', ' '), '&nbsp;', ' ')

    IF @vcResult = ''
        RETURN @vcResult

    SELECT @siPos = PATINDEX('%&#[0-9][0-9][0-9];%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 6)
            ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, 3) AS INT)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
            ,@siPos = PATINDEX('%&#[0-9][0-9][0-9];%', @vcResult)
    END

    SELECT @siPos = PATINDEX('%&#[0-9][0-9][0-9][0-9];%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 7)
            ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, 4) AS INT)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
            ,@siPos = PATINDEX('%&#[0-9][0-9][0-9][0-9];%', @vcResult)
    END

    SELECT @siPos = PATINDEX('%#[0-9][0-9][0-9][0-9]%', @vcResult)

    WHILE @siPos > 0
    BEGIN
        SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, 5)
            ,@vcResult = REPLACE(@vcResult, @vcEncoded, '')
            ,@siPos = PATINDEX('%#[0-9][0-9][0-9][0-9]%', @vcResult)
    END

    SELECT @vcResult = REPLACE(REPLACE(@vcResult, NCHAR(160), ' '), CHAR(160), ' ')

    SELECT @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult, '&amp;', '&'), '&quot;', '"'), '&lt;', '<'), '&gt;', '>'), '&amp;amp;', '&')

    RETURN @vcResult
END

GO

Illustration:

  DECLARE @S VARCHAR(MAX)='CD&amp;amp;amp;M Communications 
    auburndale oil &amp;amp;amp; propane inc  
    C F La Fountaine #7561  
    Laramie County Fire District # 2  
    AmeriGas Propane LP #2250'

    SELECT dbo.MyHTMLDecode (@s)

OUTPUT:

CD&M Communications 
auburndale oil & propane inc  
C F La Fountaine   
Laramie County Fire District # 2  
AmeriGas Propane LP 
Sathish
  • 1,936
  • 4
  • 28
  • 38
  • This script worked great except it skipped some characters. I needed to add `REPLACE(@vcResult, '"', '"')`, `REPLACE(@vcResult, '&', '&')`, `REPLACE(@vcResult, '*', '*')`, `REPLACE(@vcResult, '+', '+')` and `REPLACE(@vcResult, '/', '/')`. – Tot Zam Oct 18 '17 at 16:18
  • It is not my intent to explicitly "advertise" my own answer. But make sure to have a look at it before using the above script. This accepted answer is outdated imho. – Wouter Oct 29 '18 at 08:56
  • 2
    @Wouter: Could you clarify in what sense that accepted answer is outdated? Happy to correct. – Sathish Oct 29 '18 at 17:18
  • @Sathish: Since SQL Server supports the XML decode/encode functionality, implementing this yourself has become redundant. Also, as Tot Zam mentioned, it is not as complete a solution as the SQL Server functionality, since it does not cover all XML/HTML encoded entities. It remains a great contribution, by no means I intent to critique your work, but I no longer see an added value of using this solution, over the native SQL Server functions. – Wouter Oct 30 '18 at 09:08
  • This is the better answer because it solves the general case while the XML casting does not work if what you're getting isn't perfectly encoded. Which frankly is reality for most of us. – Merennulli May 24 '23 at 16:45
9

Previous version does not works with Japan, Korean, ... Here fixed version:

GO
IF OBJECT_ID('dbo.fn_HTMLDecode') IS NOT NULL BEGIN DROP FUNCTION dbo.fn_HTMLDecode END
GO
CREATE FUNCTION dbo.fn_HTMLDecode(
    @vcWhat NVARCHAR(MAX)
    ,@toDecodeMainISOSymbols bit = 1
    ,@toDecodeISOChars bit = 1
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @vcResult NVARCHAR(MAX);
    DECLARE @siPos INT ,@vcEncoded NVARCHAR(9) ,@siChar INT;
    SET @vcResult = RTRIM(LTRIM(CAST(REPLACE(@vcWhat COLLATE Latin1_General_BIN, CHAR(0), '') AS NVARCHAR(MAX))));
    SELECT @vcResult = REPLACE(REPLACE(@vcResult, '&#160;', ' '), '&nbsp;', ' ');
    IF @vcResult = '' RETURN @vcResult;

    declare @s varchar(35);
    declare @n int; set @n = 6;
    declare @i int;

    while @n > 2
    begin
        set @s = '';
        set @i=1;
        while @i<=@n
        begin
            set @s = @s + '[0-9]';
            set @i = @i + 1;
        end
        set @s = '%&#' + @s + '%';
        SELECT @siPos = PATINDEX(@s, @vcResult);
        WHILE @siPos > 0
        BEGIN
            SELECT @vcEncoded = SUBSTRING(@vcResult, @siPos, @n+3)
                ,@siChar = CAST(SUBSTRING(@vcEncoded, 3, @n) AS INT)
                ,@vcResult = REPLACE(@vcResult, @vcEncoded, NCHAR(@siChar))
                ,@siPos = PATINDEX(@s, @vcResult);
        END
        set @n = @n - 1;
    end

    if @toDecodeMainISOSymbols=1
    begin
        select @vcResult = REPLACE(REPLACE(@vcResult, NCHAR(160), ' '), CHAR(160), ' ');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult, '&amp;', '&'), '&quot;', '"'), '&lt;', '<'), '&gt;', '>'), '&amp;amp;', '&'),'&rdquo;','”'),'&bdquo;','„'),'&ndash;','–'),'&mdash;','—');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult,'&lsquo;','‘'),'&rsquo;','’'),'&bull;','•'),'&hellip;','…'),'&permil;','‰') COLLATE Latin1_General_BIN,'&prime;','′') COLLATE Latin1_General_BIN,'&Prime;','″'),'&circ;','ˆ'),'&tilde;','˜'),'&nbsp;',' ');
    end

    if @toDecodeISOChars=1
    begin
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Scaron;','Š') COLLATE Latin1_General_BIN,'&scaron;','š') COLLATE Latin1_General_BIN,'&Ccedil;','Ç') COLLATE Latin1_General_BIN,'&ccedil;','ç');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult,'&Agrave;','À') COLLATE Latin1_General_BIN,'&agrave;','à') COLLATE Latin1_General_BIN,'&Aacute;','Á') COLLATE Latin1_General_BIN,'&aacute;','á') COLLATE Latin1_General_BIN,'&Acirc;','Â') COLLATE Latin1_General_BIN,'&acirc;','â') COLLATE Latin1_General_BIN,'&Atilde;','Ã') COLLATE Latin1_General_BIN,'&atilde;','ã') COLLATE Latin1_General_BIN,'&Auml;','Ä') COLLATE Latin1_General_BIN,'&auml;','ä') COLLATE Latin1_General_BIN,'&Aring;','Å') COLLATE Latin1_General_BIN,'&aring;','å') COLLATE Latin1_General_BIN,'&AElig;','Æ') COLLATE Latin1_General_BIN,'&aelig;','æ');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Egrave','È') COLLATE Latin1_General_BIN,'&egrave','è') COLLATE Latin1_General_BIN,'&Eacute;','É') COLLATE Latin1_General_BIN,'&eacute;','é') COLLATE Latin1_General_BIN,'&Ecirc;','Ê') COLLATE Latin1_General_BIN,'&ecirc;','ê') COLLATE Latin1_General_BIN,'&Euml;','Ë') COLLATE Latin1_General_BIN,'&euml;','ë');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Igrave;','Ì') COLLATE Latin1_General_BIN,'&igrave;','ì') COLLATE Latin1_General_BIN,'&Iacute;','Í') COLLATE Latin1_General_BIN,'&iacute;','í') COLLATE Latin1_General_BIN,'&Icirc;','Î') COLLATE Latin1_General_BIN,'&icirc;','î') COLLATE Latin1_General_BIN,'&Iuml;','Ï') COLLATE Latin1_General_BIN,'&iuml;','ï');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Ograve;','Ò') COLLATE Latin1_General_BIN,'&ograve;','ò') COLLATE Latin1_General_BIN,'&Oacute;','Ó') COLLATE Latin1_General_BIN,'&oacute;','ó') COLLATE Latin1_General_BIN,'&Ocirc;','Ô') COLLATE Latin1_General_BIN,'&ocirc;','ô') COLLATE Latin1_General_BIN,'&Otilde;','Õ') COLLATE Latin1_General_BIN,'&otilde;','õ') COLLATE Latin1_General_BIN,'&Ouml;','Ö') COLLATE Latin1_General_BIN,'&ouml;','ö') COLLATE Latin1_General_BIN,'&Oslash','Ø') COLLATE Latin1_General_BIN,'&oslash','ø');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&Ugrave;','Ù') COLLATE Latin1_General_BIN,'&ugrave;','ù') COLLATE Latin1_General_BIN,'&Uacute;','Ú') COLLATE Latin1_General_BIN,'&uacute;','ú') COLLATE Latin1_General_BIN,'&Ucirc;','Û') COLLATE Latin1_General_BIN,'&ucirc;','û') COLLATE Latin1_General_BIN,'&Uuml;','Ü') COLLATE Latin1_General_BIN,'&uuml;','ü');
        select @vcResult = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@vcResult COLLATE Latin1_General_BIN,'&ETH;','Ð') COLLATE Latin1_General_BIN,'&eth;','ð') COLLATE Latin1_General_BIN,'&Ntilde;','Ñ') COLLATE Latin1_General_BIN,'&ntilde;','ñ') COLLATE Latin1_General_BIN,'&Yacute;','Ý') COLLATE Latin1_General_BIN,'&yacute;','ý') COLLATE Latin1_General_BIN,'&THORN;','Þ') COLLATE Latin1_General_BIN,'&thorn;','þ') COLLATE Latin1_General_BIN,'&szlig;','ß');
    end
    RETURN @vcResult;
END
-- test:
-- select dbo.fn_HTMLDecode(N'A fine example of man and nature co-existing is Slovenia&#8217;s ecological tourist farms.',1,1)
-- select dbo.fn_HTMLDecode(N'm0 &#50752;&#51064;&#48516;&#50556;&#50640;&#49436; m1 &#44032;&#51109; m2 &#50689;&#54693;&#47141; m3&#51080;&#45716;m10',1,1)
QazyCat
  • 129
  • 1
  • 2
3

Using the casting method from @Wouter handles decimal and hex entities, which the accepted answer does not.

However, a handler is also needed to deal with XML special characters in the input.

XML special characters: <, >, ", ', &

create function [dbo].[XMLdecode] (@input nvarchar(max))
returns nvarchar(max)
begin

    declare @output nvarchar(max) = ''
    declare @next nchar(1)
    declare @endIdx int = 0
    declare @idx int = 0
    while @idx < len(@input)
    begin

        set @idx += 1
        set @next = substring(@input, @idx, 1)
        set @endIdx = charindex(';', @input, @idx) - @idx

        if @next = '&' and (@endIdx > 8 or @endIdx < 1) 
            set @output += '&amp;'
        else if @next = '&' and @endIdx > 1 and @endIdx < 8
        begin
            set @output += lower(substring(@input,@idx,@endIdx+1))
            set @idx += @endIdx
        end
        else
            set @output += @next
    end

    set @output = replace(@output,'<','&lt;')
    set @output = replace(@output,'>','&gt;')
    set @output = replace(@output,'"','&quot;')
    set @output = replace(@output,'''','&apos;')

    return cast(@output as XML).value('.[1]','nvarchar(max)')

end;

Example usage:

select dbo.XMLdecode('this is a t&#xc9;st &#xba;f HEX &amp; DECIMAL &#8364;ntities & <<<< non-entities too! &#9745;')
------------------------------------------------------------------------------
returns: 'this is a tÉst ºf HEX & DECIMAL €ntities & <<<< non-entities too! ☑'

It's not bulletproof, but handled all of my cases.

Something that looks like an HTML entity but which is not actually, can still cause trouble. For example: &#x2; or even &;

It also does not handle non-XML named entities like &check; or &hearts;

egerardus
  • 11,316
  • 12
  • 80
  • 123