This is not a question on how to overcome the "XML parsing: ... illegal xml character" error, but about why it is happening? I know that there are fixes(1, 2, 3), but need to know where the problem arises from before choosing the best solution (what causes the error under the hood?).
We are calling a Java-based webservice using C#. From the strongly-typed data returned, we are creating an XML file that will be passed to SQL Server. The webservice data is encoding using UTF-8, so in C# we create the file, and specify UTF-8 where appropriate:
var encodingType = Encoding.UTF8;
// logic removed...
var xdoc = new XDocument();
xdoc.Declaration = new XDeclaration("1.0", encodingType.WebName, "yes");
// logic removed...
System.IO.File.WriteAllText(xmlFullPath, xdoc.Declaration.ToString() + xdoc.Document.ToString(), encodingType);
This creates an XML file on disk that has contains the following (abbreviated) data:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>
Notice that in the second record, -
is different to –
. I believe the second instance is en-dash.
If I open that XML file in Firefox/IE/VS2015. it opens without error. The W3C XML validator also works fine. But, SSMS 2012 does not like it:
declare @xml XML = '<?xml version="1.0" encoding="utf-8" standalone="yes"?><records>
<r RecordName="Option - Foo" />
<r RecordName="Option – Bar" />
</records>';
XML parsing: line 3, character 25, illegal xml character
So why does en-dash cause the error? From my research, it would appear that
...only a few entities that need escaping: <,>,\,' and & in both HTML and XML. Source
...of which en-dash is not one. An encoded version (replacing –
with –
) works fine.
UPDATE
Based on the input, people state that en-dash isn't recognised as UTF-8, but yet it is listed here http://www.fileformat.info/info/unicode/char/2013/index.htm So, as a perfectly legal character, why won't SSMS read it when passed as XML (using UTF-8 OR UTF-16)?