11

This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).

We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:

var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);

This creates an XML file on disk that has contains the following (abbreviated) data:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>

Notice that in the second record, - is different to . I believe the second instance is en-dash.

If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:

declare @xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 25, illegal xml character

So why does en-dash cause the error? From my research, it would appear that

...only a few entities that need escaping: <,>,\,' and & in both HTML and XML. Source

...of which en-dash is not one. An encoded version (replacing with &#8211;) works fine.

UPDATE

Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?

Community
  • 1
  • 1
EvilDr
  • 8,943
  • 14
  • 73
  • 133
  • Do you get the same SQL error when omitting the XML declaration and solely feeding `declare @xml XML = ' ';`? – Martin Honnen Apr 04 '16 at 12:25
  • It works when the declaration is omitted. BUT - isn't that bad practice potentially leading to other issues? Its not the fix so much that I'm after, but the cause. – EvilDr Apr 04 '16 at 12:29
  • Or try encoding="utf-16" .NET is probably not the problem. It is MSSQL. – smoore4 Apr 04 '16 at 12:29
  • 1
    This is the background on why.....http://stackoverflow.com/questions/3760788/how-to-solve-unable-to-switch-the-encoding-error-when-inserting-xml-into-sql-s – smoore4 Apr 04 '16 at 12:32

4 Answers4

10

Please permit me to answer my own question, for the purpose of me understanding it fully myself. I won't accept this as the answer; it is the combination of the other answers that lead me here. If this answer helps you in the future, please upvote the other posts also.

The basic underlying rule is that XML with Unicode characters should be passed to, and parsed as, Unicode by SQL Server. Therefore C# should generate XML as UTF-16; the SSMS and .Net default.

Cause of original problem

This variable declares XML with UTF-8 encoding, but the entity en-dash cannot be used without being encoded in UTF-8. This is wrong:

DECLARE @badxml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 3, character 29, illegal xml character

Another approach that doesn't work is to switch UTF-8 to UTF-16 in the XML. The string here is not unicode, so the implicit conversion fails:

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

XML parsing: line 1, character 56, unable to switch the encoding

Solutions

Alternatives that work are:

1) Leave as UTF-8 but encode with hexadecimal on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#x2013; Bar" />
</records>';

2) As above but with decimal encoding on the entity (reference):

DECLARE @xml xml = '<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
  <r RecordName="Option &#8211; Bar" />
</records>';

3) Include the original entity, but remove UTF-8 encoding in declaration (SSMS then applies UTF-16; its default):

DECLARE @xml xml = '<?xml version="1.0" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';

4) Retain the UTF-16 declaration, but cast the XML to Unicode (note the preceding N before casting as XML):

DECLARE @xml xml = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option – Bar" />
</records>';
Simon Kingston
  • 495
  • 2
  • 15
EvilDr
  • 8,943
  • 14
  • 73
  • 133
6

Can you modify the XML encoding declaration? If so;

declare @xml XML = N'<?xml version="1.0" encoding="utf-16" standalone="yes"?><records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';

select @xml

(No column name)
<records><r RecordName="Option - Foo" /><r RecordName="Option – Bar" /></records>

Speculative Edit

Both of these fail with illegal xml character:

set @xml = '<?xml version="1.0" encoding="utf-8"?><x> – </x>'
set @xml = '<?xml version="1.0" encoding="utf-16"?><x> – </x>'

because they pass a non-unicode varchar to the XML parser; the string contains Unicode so must be treated as such, i.e. as an nvarchar (utf-16) (otherwise the 3 bytes comprising the are misinterpreted as multiple characters and one or more is not in the acceptable range for XML)

This does pass a nvarchar string to the parser, but fails with unable to switch the encoding:

set @xml = N'<?xml version="1.0" encoding="utf-8"?><x> – </x>'

This is because an nvarchar (utf-16) string is passed to the XML parser but the XML document states its utf-8 and the is not equivalent in the two encodings

This works as everything is utf-16

set @xml = N'<?xml version="1.0" encoding="utf-16"?><x> – </x>'
Alex K.
  • 171,639
  • 30
  • 264
  • 288
5

SQL Sever internally uses UTF-16. Either let the encoding away or cast to unicode

The reason you are looking for: With UTF-8 specified, this character is not known.

--without your directive, SQL Server picks its default
declare @xml XML = 
'<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>';
select @xml;

--or UNICODE, but you must use UTF-16
declare @xml2 XML = 
CAST('<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<records>
  <r RecordName="Option - Foo" />
  <r RecordName="Option – Bar" />
</records>' AS NVARCHAR(MAX));

select @xml2

UPDATE

UTF-8 means, that there are chunks of 8 bits used to carry information. The base characters are just one chunk, easy going...

Other characters can be encoded as well. There are "c2" and "c3" codes (look here). c3-codes need three chunks to be encoded. But the internally used UTF16 expects 2 byte encoded characters.

Hope this is clear now...

UPDATE 2

This code will show you, that the Hyphen has the ASCII code 45 and your en-dash 150:

DECLARE @x VARCHAR(100)=
'<r RecordName="Option - Foo" /><r RecordName="Option – Bar" />';

WITH RunningNumbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Nmbr
    FROM sys.objects
)
SELECT SUBSTRING(@x,Nmbr,1), ASCII(SUBSTRING(@x,Nmbr,1)) AS ASCII_Code
FROM RunningNumbers
WHERE ASCII(SUBSTRING(@x,Nmbr,1)) IS NOT NULL;

Have a look here All characters with 7 bits are "plain" and should encode without problems. The "extended ASCII" is depending on code tables and could vary. 150 might be en-dash or something else. UTF8 uses some tricky encodings to allow strange characters to be "legal". Obviously (this was new to me too) the internally used UTF16 cannot cope with c3-characters.

Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • Yes I read this as a workaround, but the question is *why* the behavior occurs. Please see update in question for more. – EvilDr Apr 04 '16 at 12:43
  • With UTF8, it *should* be known as its listed as a UTF8 character here: http://www.fileformat.info/info/unicode/char/2013/index.htm. Its all very strange to me... :-/ – EvilDr Apr 04 '16 at 12:48
  • @EvilDr Where do you see that it is a legal UTF8 character? As far as I see your link lists *escape sequences* for different encodings. Doesn't this `0xE2 0x80 0x93 (e28093)` show clearly, that it is **not** a legal UTF8 character? – Shnugo Apr 04 '16 at 12:58
  • I just read the title, *"Unicode Character 'EN DASH'"*. I'm probably wrong; I'll be dreaming about the damn character for the next month – EvilDr Apr 04 '16 at 13:02
  • I don't get how the source (UTF8) system then can provide me with that illegal character, or not throw an error of some sort when that entity is created? – EvilDr Apr 04 '16 at 13:02
  • 1
    utf-8 is not a *subset* of unicode, the dash is 3 bytes in utf-8 which is perfectly fine as utf-8 is a variable length encoding. Its 2 bytes in utf-16 however and that fact is probably why there is an issue with the encoding of the string literal and the xml parser. – Alex K. Apr 04 '16 at 13:04
  • Lovely answer thanks. I feel sadly I must accept Alex's as it was posted slightly before. Thank you for all your help - a real valuable learning experience. – EvilDr Apr 04 '16 at 14:40
  • 1
    @EvilDr No problem, happy coding! – Shnugo Apr 04 '16 at 14:41
2

The MSDN guidelines says:

SQLXML 4.0 relies upon the limited support for DTDs provided in SQL Server. SQL Server allows for an internal DTD in xml data type data, which can be used to supply default values and to replace entity references with their expanded contents. SQLXML passes the XML data "as is" (including the internal DTD) to the server. You can convert DTDs to XML Schema (XSD) documents using third-party tools, and load the data with inline XSD schemas into the database.

Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331