1

I'm using SQL Server 2012, and I have a nvarchar that has XML-escaped representations of UTF characters because they are present in a filename (most notably \uFFFF).

How can I run through a nvarchar and replace all these occurrences with their actual Unicode characters?

This works, but I'd have to enumerate each one, and this is not really feasible (Invalid: [#x0-#x8]|#xB|#xC|[#xE-#x1F]|[#x7F-#x84]|[#x86-#x9F]|[#xD800-#xDFFF]|[#xFDD0-#xFDEF]|#xFFFE|#xFFFF|[#x10FFFF-Up])

DECLARE @string nvarchar(2000)
SET @string = N''
SELECT @string,
REPLACE(REPLACE(REPLACE(
REPLACE(REPLACE(REPLACE(
    @string, 
'', NCHAR(0xFFFF)), '', NCHAR(0xFFFE)), '�', NCHAR(0x0000)),
'', NCHAR(65535)), '', NCHAR(65534)), '�', NCHAR(0))

How can I replace all of them at once? (&#x******; => NCHAR(0x******) and &#******; => NCHAR(******)), ideally without a CLR function.

Ehryk
  • 1,930
  • 2
  • 27
  • 47
  • The reason for doing this is to perform a bulk-insert of filenames into a SQL Server table. For bulk-inserts, XML is can be passed directly to a stored procedure and SS can natively parse it, but XML cannot contain all of the characters that are valid in Windows filenames and C# strings (one example being `\uFFFF`). – Ehryk Mar 13 '15 at 20:44

2 Answers2

1

There doesn't seem to be much sense in translating invalid sequences such as &#xFFFF, but if you strip the invalid ones out via REPLACE, you can translate in one shot the valid ones by converting to XML and then back to NVARCHAR again:

DECLARE @string NVARCHAR(2000);
SET @string = N'<test>&#xF0F0;&#5535;</test>';
SELECT @string AS [Original],
       CONVERT(XML, @string) AS [ConvertedToXml],
       CONVERT(NVARCHAR(2000), CONVERT(XML, @string)) AS [ConvertedToXmlThenToNVarChar]

Returns:

Original                        ConvertedToXML     ConvertedToXmlThenToNVarChar
<test>&#xF0F0;&#5535;</test>    <test>ᖟ</test>    <test>ᖟ</test>

But this will error if you don't first replace with invalid sequences with '' (empty string).


UPDATE:

Relating to the other question on this topic ( How do I properly handle &#xFFFF; in UTF-8 XML? ), you could do something similar to the following which translates invalid characters to custom escape sequences which can be converted on the way out:

DECLARE @Original NVARCHAR(2000),
        @TempXml XML,
        @StoredAsNVarChar NVARCHAR(2000),
        @Extracted NVARCHAR(2000);

SET @Original = N'<FileName>&#xF0F0;풜〣&#xFFFF;&#xFFFF;</FileName>';
SET @Original = REPLACE(@Original, N'&#xFFFF;', N'\uFFFF;');

SET @TempXml = CONVERT(XML, @Original);

SET @StoredAsNVarChar = CONVERT(NVARCHAR(2000), @TempXml);

SET @Extracted = REPLACE(@StoredAsNVarChar, N'\uFFFF;', NCHAR(65535));

SELECT @Original AS [OriginalAfterTranslatingInvalidCharacters],
       @TempXml AS [ConvertedOriginalToXml],
       @StoredAsNVarChar AS [ConvertedXmlBackToNVarChar],
       @Extracted AS [ExtractedAndTranslatedBackToInvalidCharacters];

Still, I would suggest attempting to first rename the files such that they do not have invalid Unicode characters in them to begin with, and then import into SQL Server. I can't imagine that this will be the only problem you will face with these files given the invalid characters in the filename. And as I mentioned in the other question, are you sure that the name isn't somehow being misreported by PowerShell? Are you able to write a small console app in C# or VB.Net that can make use of DirectoryInfo to list the files in a directory?


UPDATE 2:

Based on the discussion (in comments) in the other question (linked above), it is now understood that the need for this to be answered relates to using XML as a transport mechanism to bulk insert file info. While XML can be used to send in an array of data for this purpose, a more efficient means is Table-Valued Parameters (introduced in SQL Server 2008), which is a strongly-typed collection that appears as a Table Variable on the database side. I provide an example of doing this properly (using a DataTable, which most people tend to use, is not the proper way) in the following answer:

Pass Dictionary<string,int> to Stored Procedure T-SQL

And there is a link in that answer to another answer where I provided another example of using a TVP to stream data into SQL Server.

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • My whole point in doing this is to retain "invalid" XML Entities - specifically ``. This can't be stored in an nvarchar as a unicode character (\uFFFF)? – Ehryk Mar 13 '15 at 05:54
  • I have a file on a Windows file system, and a database of the files. The file's actual name as reported by PowerShell's Get-ChildItem is `풜〣` (shows up in cmd dir as `????`, there are four characters in it). See: http://stackoverflow.com/questions/29022169/how-do-i-properly-handle-xffff-in-utf-8-xml – Ehryk Mar 13 '15 at 05:57
  • @Ehryk yes, it can be stored as `NVARCHAR`, but code-point 65535 / xFFFF is still not a valid character. I will update with another suggestion, reflecting what is discussed in your other [question](http://stackoverflow.com/questions/29022169/how-do-i-properly-handle-xffff-in-utf-8-xml/29036559#29036559) on this topic. – Solomon Rutzky Mar 13 '15 at 15:54
  • Could you help me out with some terminology on this? What's a word for 'characters + non-characters', both of which can clearly be used in windows filenames and have Unicode definitions? I've been using 'character' for these instances, and don't know what else to call 'character-like things' in filenames/Unicode. – Ehryk Mar 13 '15 at 16:07
  • @Ehryk Check out "Update 2" in my answer for an alternative to the XML method of passing in an array. – Solomon Rutzky Mar 13 '15 at 22:15
  • Thanks, I will definately look in to TVP now that I understand the limitations of XML regarding certain Unicode code-points. Is 'code-points' a better syntax for Unicode characters+non-characters? I'm having a rough time getting out of using characters because: 1. a filename is made of 'characters' in colloquial speech, 2. Unicode non-characters are represented by SQL Server `NCHAR` data types and can be returned by `NCHAR(0xFFFF)` making the Unicode "non-character" a character by definition in SQL Server, 3. a C# string can also have them, and is composed of a character array. – Ehryk Mar 13 '15 at 23:01
  • Thus, I can't seem to escape `\uFFFF` being a _character_ in the context of C# strings, SQL Server NVARCHARs, and filenames, whether or not Unicode considers it a "non-character" it is still a character. A _non-character character_? – Ehryk Mar 13 '15 at 23:02
  • Also, I did verify with PowerShell's `Get-ChildItem` and `cmd /U` and a C# console application that this is the actual file's filename. – Ehryk Mar 13 '15 at 23:54
  • @Ehryk Good to know that it is what it is reported as being. Unfortunate that whatever person or software generated that filename didn't know to not use those characters. Oh well, too late now ;-). But this definitely points in the direction of using the TVP method. It is more efficient than XML anyway. – Solomon Rutzky Mar 14 '15 at 01:39
0

I was looking for a generic solution that handled all XML Escape sequences, invalid and valid alike, in a NVARCHAR and replaced them with the character values. I wrote this up as I could not find an example that did not involve enumerating each one with then thousands REPLACE statements (at least 65535), and got it working. If there is a better way to do it (non-CLR), I'd like to know about it.

/* Converts '0F0F' to 0x0F0F */
CREATE FUNCTION dbo.GetHex(@input nvarchar(max))
RETURNS varbinary(max)
AS
BEGIN
    RETURN CONVERT(varbinary(max), @input, 2)
END

GO

/* Converts '0F0F' to the nchar \u0F0F */
CREATE FUNCTION dbo.GetNcharHex(@input nvarchar(max))
RETURNS nchar(1)
AS
BEGIN
    RETURN NCHAR(dbo.GetHex(@input))
END

GO

/* Converts '123' to the nchar \u7B (decimal 123 in hex) */
CREATE FUNCTION dbo.GetNcharDec(@input nvarchar(max))
RETURNS nchar(1)
AS
BEGIN
    RETURN NCHAR(CONVERT(int, @input))
END

GO

/* Replaces a group of @n nchars inside @prefix and @suffix with the of the @n nchars interpreted as hexadecimal; if @prefix = '&#x', @suffix = ';', and @n = 2 then it will replace all 2 digit hex XML entities: e.g. '&#x52;' with 'R' */
CREATE FUNCTION dbo.ReplaceNGroupsHex(@input nvarchar(max), @prefix nvarchar(max), @group nvarchar(max), @suffix nvarchar(max), @n int = 2)
RETURNS nvarchar(max)
AS
BEGIN
    DECLARE @pattern nvarchar(max), @location int
    SET @pattern = '%' + @prefix + REPLICATE(@group, @n) + @suffix + '%'

    WHILE(1=1)
    BEGIN
        SET @location = PATINDEX(@pattern, @input)
        IF (@location = 0) BREAK
        SET @input = REPLACE(@input, SUBSTRING(@input, @location, LEN(@prefix) + @n + LEN(@suffix)), dbo.GetNcharHex(SUBSTRING(@input, @location + LEN(@prefix), @n)))
    END
    RETURN @input
END

GO

/* Replaces a group of @n nchars inside @prefix and @suffix with the of the @n nchars interpreted as decimal; if @prefix = '&#', @suffix = ';', and @n = 2 then it will replace all 2 digit decimal XML entities: e.g. '&#33;' with '!' */
CREATE FUNCTION dbo.ReplaceNGroupsDec(@input nvarchar(max), @prefix nvarchar(max), @group nvarchar(max), @suffix nvarchar(max), @n int = 1)
RETURNS nvarchar(max)
AS
BEGIN
    DECLARE @pattern nvarchar(max), @location int
    SET @pattern = '%' + @prefix + REPLICATE(@group, @n) + @suffix + '%'

    WHILE(1=1)
    BEGIN
        SET @location = PATINDEX(@pattern, @input)
        IF (@location = 0) BREAK
        SET @input = REPLACE(@input, SUBSTRING(@input, @location, LEN(@prefix) + @n + LEN(@suffix)), dbo.GetNcharDec(SUBSTRING(@input, @location + LEN(@prefix), @n)))
    END
    RETURN @input
END

GO

/* Replaces all Hexadecimal XML entities of @n length or lower (grouped in pairs); @n = 4 will result in '&#x0041;&#x41;' => 'AA', and @n = 2 will result in '&#x0041;&#x41;' => '&#x0041;A' */
CREATE FUNCTION dbo.ReplaceHexEntities(@input nvarchar(max), @n int = 4)
RETURNS nvarchar(max)
AS
BEGIN
    WHILE(@n > 0)
    BEGIN
        SET @input = dbo.ReplaceNGroupsHex(@input, N'&#x', '[0-9,A-F,a-f]', ';', @n)
        SET @n = @n - 2
    END
    RETURN @input
END

GO

/* Replaces all Decimal XML entities of @n length or lower; @n = 5 will result in '&#65514;&#74;' => '↑J', and @n = 2 will result in '&#118;&#70;' => '&#118;F' */
CREATE FUNCTION dbo.ReplaceDecEntities(@input nvarchar(max), @n int = 5)
RETURNS nvarchar(max)
AS
BEGIN
    WHILE(@n > 0)
    BEGIN
        SET @input = dbo.ReplaceNGroupsDec(@input, N'&#', '[0-9]', ';', @n)
        SET @n = @n - 1
    END
    RETURN @input
END

GO

/* Replaces all XML Entities up to: \uFFFF (Hex) and \u1869F (Decimal 99999) */
CREATE FUNCTION dbo.ReplaceEntities(@input nvarchar(max))
RETURNS nvarchar(max)
AS
BEGIN
    SET @input = dbo.ReplaceHexEntities(@input, DEFAULT)
    SET @input = dbo.ReplaceDecEntities(@input, DEFAULT)
    RETURN @input
END

GO

SELECT dbo.GetHex('00FFAA')
SELECT dbo.ReplaceNGroupsDec(N'z&#xFFFF;&#x0042;&#65535;', '&#', '[0-9]', ';', 5)
SELECT dbo.ReplaceHexEntities(N'z&#xFFFF;&#x0042;&#65535;', DEFAULT)
SELECT dbo.ReplaceDecEntities(N'z&#xFFFF;&#x0042;&#65535;&#69;', DEFAULT)
SELECT dbo.ReplaceEntities(N'z&#xFFFF;&#x0042;&#65535;&#69;'), LEN(dbo.ReplaceEntities(N'z&#xFFFF;&#x0042;&#65535;&#69;'))
Ehryk
  • 1,930
  • 2
  • 27
  • 47