2

I'm having some issues representing , a valid UTF-16 construct and also apparently valid in a Windows filename, in XML for consumption in SQL Server XML (2012).

Take this for example:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 41, illegal xml character

This is, however, legal Unicode (a "Unicode Noncharacter" ''): http://www.fileformat.info/info/unicode/char/ffff/index.htm

So, I tried this:

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0" encoding="utf-16"?><FileName>풜〣&#xFFFF;&#xFFFF;</FileName>'

-- Result: XML parsing: line 1, character 59, illegal xml character

So - how am I supposed to represent this file name accurately in XML? I can't just remove the characters, they are indeed &#xFFFF; characters as reported by Get-ChildItem, and I will need to retain a handle to this file.

I can get the XML parsing to work by replacing &#xFFFF; with &#xEF;&#xBF;&#xBF; which is the UTF-8 representation of \uFFFF according to this link I then try to take this XML and insert it to a nvarchar column, and I need this to be the proper representation of the filename.

DECLARE @Xml xml;
SET @Xml = N'<?xml version="1.0"?><FileName>풜〣&#xEF;&#xBF;&#xBF;&#xEF;&#xBF;&#xBF;</FileName>'
SELECT F.Item.value('.', 'nvarchar(2000)') FROM @Xml.nodes('//FileName') as F(Item)

-- Returns 풜〣ï¿¿ï¿¿ (not correct)
Community
  • 1
  • 1
Ehryk
  • 1,930
  • 2
  • 27
  • 47

2 Answers2

2

The set of characters allowed in XML documents is defined by production 2 of the XML specification. It excludes U+FFFF (which is defined as a non-character by Unicode, and which at the time XML was being developed was not allowed in information interchange using Unicode).

That means you cannot represent U+FFFF literally in an XML document, or using an XML numeric character reference. You can of course invent your own escape mechanism, or use something like URI escaping, to encode the character in your data; before inserting the data into applications that allow U+FFFF, you will of course have to unescape your representation.

I wonder why non-characters are allowed in Windows file names.

C. M. Sperberg-McQueen
  • 24,596
  • 5
  • 38
  • 65
  • I only really need to get the conversion to nvarchar in the SQL statement at the end to return the proper nvarchar representation of those characters - do you have any suggestions as far as how to do this most efficiently? – Ehryk Mar 12 '15 at 23:38
  • I.E. how do I get from `FFFF` to the proper unicode character in an nvarchar? – Ehryk Mar 12 '15 at 23:48
  • I'd try the SQL Server function `replace( x, '\uFFFF', nchar(65536))`, where `x` is replaced by a reference to the string value of the appropriate XML element and `\uFFFF` is replaced by whatever you decide to use as an escape sequence for the character. (Actually, no. It's unlikely that U+FFFF is the only problematic character that's going to turn up, so if it were me, I'd use my language's URI escaping function to create the string value of the XML element, and the corresponding de-escaping function to create the string needed for the SQL column. But YMMV.) – C. M. Sperberg-McQueen Mar 13 '15 at 00:37
  • I found `NCHAR`: `NCHAR(0xFFFF)` or `NCHAR(CONVERT(varbinary(20), '0x' + @string, 1))` – Ehryk Mar 13 '15 at 00:41
  • What's worse in all this is that you can actually make a windows filename `.txt` or `.txt`, which means I'll have to somehow distinguish the files with these characters in them vs. files that have the XML Entity definition in them. – Ehryk Mar 13 '15 at 16:26
  • Yes; that's why I'd use a general purpose escape mechanism like URI escaping. – C. M. Sperberg-McQueen Mar 13 '15 at 18:10
1

&#xFFFF; (i.e. decimal 65535) is not a legal character, even according to the link provided in the question that states "This is, however, legal UTF-16". That link shows it to be illegal, non-character, and cannot be represented in any way (according to their test pages).

Also, according to Unicode.org:

Noncharacters
These codes are intended for process-internal uses.

FFFE <not a character>
• may be used to detect byte order by contrast with FEFF 
→ FEFF = zero width no-break space

FFFF <not a character>

And, according to W3C, the list of valid characters is:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */


In order to get this into XML (at least in terms of the SQL Server XML datatype), you need to first replace any occurrences of &#xFFFE; and &#xFFFF; with a custom escape sequence, such as \uFFFE; and \uFFFF; resectively. Then, when converting back to NVARCHAR you can simply replace \uFFFE; with NCHAR(65534) and \uFFFF; with NCHAR(65535) respectively.

OR, you can Base64 Encode the values (rather easy on the app code side) and decode on the way out. If you need to access this on the database side, you can create your own SQLCLR functions to Base64 Encode and Decode, or just grab the Free version of the SQL# library (which I am the author of) which includes Convert_ToBase64 and Convert_FromBase64 and can be used as follows:

DECLARE @Encoded NVARCHAR(200),
        @Decoded NVARCHAR(200);

SET @Encoded =
    SQL#.Convert_ToBase64(CONVERT(VARBINARY(200), N'f' + NCHAR(65535) + N'g'), 'None');

SELECT CONVERT(XML, N'<test>' + @Encoded + N'</test>');

SET @Decoded = SQL#.Convert_FromBase64(@Encoded);
SELECT @Encoded AS [Encoded],
       @Decoded AS [Decoded],
       DATALENGTH(@Decoded) AS [NumBytes], -- 6 bytes = 3 characters (most of the time)
       UNICODE(SUBSTRING(@Decoded, 2, 1)) AS [TaDa!] -- get value of middle character

Returns:

<test>ZgD//2cA</test>

and then:

Encoded     Decoded     NumBytes    TaDa!
ZgD//2cA    fg          6           65535

It seems that it should not be used as part of a filename (I know that is not your doing), or perhaps it is really a valid character that is being read incorrectly by Get-ChildItem.

I will need to retain a handle to this file.

Is it possible to rename the file to remove the invalid characters prior to importing the names/info into SQL Server? Just a thought.


And just FYI, you are not allowed to change the encoding via the xml declaration, at least not so easily: SQL Server 2008R2 and creating XML document

Community
  • 1
  • 1
Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
  • The link shows it to be a valid, defined, Unicode non-character represented in UTF-16 as `` (my intent by the above statement). I cannot rename the files that are being cataloged and `\uFFFF` is a valid, though discouraged, Unicode character in a windows/NTFS filename or directory. – Ehryk Mar 13 '15 at 15:55
  • @Ehryk I am not trying to be argumentative, but that link specifically says that it is a "Noncharacter" and the graphical representation of it on that page even says "Illegal" at the top and "Not Unicode" at the bottom. Not sure what else to make of it. Regardless, it will never be accepted in XML. – Solomon Rutzky Mar 13 '15 at 16:03
  • I can't find what you're talking about - I don't see 'illegal' on any of the links here, and how can it be "not Unicode" if it is defined in Unicode as a 'Unicode Non-Character'? Can you copy the link you are referencing here? – Ehryk Mar 13 '15 at 16:18
  • Oh, I see - the graphic inside FileFormat.info. I'd have to say this is incorrect in that `\uFFFF` is explicitly defined in Unicode, is a valid Unicode Non-character, and is represented in UTF-16 as ``. Their graphic is wrong. – Ehryk Mar 13 '15 at 16:24
  • @Ehryk But look at the info from Unicode.org (and click on the link). These two are specifically listed as ****. This is a distinction from other code-points that are not defined, such as most of them in the FFF0 - FFFF range. So, yes, technically these are "defined", but defined as being unusable ;-). – Solomon Rutzky Mar 13 '15 at 16:54
  • They are defined as non characters, but this does not make them unusable. They can be used to represent byte order marks, for example - just not used as characters. They can also be used legally in Windows filenames. – Ehryk Mar 13 '15 at 17:15
  • The BEL character would be another example - can't be used graphically, but can be used to beep a terminal and thus is a usable non-character. – Ehryk Mar 13 '15 at 17:18
  • @Ehryk I understand your point, but there is no possible way to get that code-point into XML. You might be able to Base64 encode the entire filename. That is something that can be done via an XML function even. – Solomon Rutzky Mar 13 '15 at 17:37
  • Right, I think XML is then unsuitable for direct storage of windows filenames and C# strings directly, as they can contain Unicode code-points that are not valid in XML. Is this well-known? – Ehryk Mar 13 '15 at 18:53
  • @Ehryk Yes, I would agree that XML is not optimal in this regards. Why are you using XML in the first place? Just curious. – Solomon Rutzky Mar 13 '15 at 18:58
  • To perform bulk inserts instead of an individual insert for each file (they number in the millions) – Ehryk Mar 13 '15 at 19:38
  • @Ehryk Ok. That is actually quite the easy problem to solve. But first, this particular question is not about uploading files in bulk. I have just updated my answer to include additional info from W3C regarding valid characters. I suggest we continue this in the other question as that is more focused on the bulk issue and not about a particular character. I will also add my suggestion about Base64 encoding here as that is still an option. – Solomon Rutzky Mar 13 '15 at 19:52
  • @Ehryk I have finished my updates here. Please mention the reason for the XML in a comment on the other question and we can proceed from there. But there is a better (and faster) way to bulk upload this data :). – Solomon Rutzky Mar 13 '15 at 20:40
  • Done, and thank you. I'm interested in the better and faster way to bulk upload this data (hopefully to a stored procedure, not through shared access to a file or some such thing that imposes additional configuration). – Ehryk Mar 13 '15 at 20:46
  • @Ehryk Very slight additional config, but so worth it. Yes, to a store proc, no shared file or global temp table, etc. – Solomon Rutzky Mar 13 '15 at 20:52