1

I'm trying to create a stored procedure in SQL Server 2016 that converts XML that was previously converted into Varbinary back into XML, but getting an "Illegal XML character" error when converting. I've found a workaround that seems to work, but I can't actually figure out why it works, which makes me uncomfortable.

The stored procedure takes data that was converted to binary in SSIS and inserted into a varbinary(MAX) column in a table and performs a simple

CAST(Column AS XML)

It worked fine for a long time, and I only began seeing an issue when the initial XML started containing an ® (registered trademark) symbol.

Now, when I attempt to convert the binary to XML I get this error

Msg 9420, Level 16, State 1, Line 23
XML parsing: line 1, character 7, illegal xml character

However, if I first convert the binary to varchar(MAX), then convert that to XML, it seems to work fine. I don't understand what is happening when I perform that intermediate CAST that is different than casting directly to XML. My main concern is that I don't want to add it in to account for this scenario and end up with unintended consequences.

Test code:

DECLARE @foo VARBINARY(MAX)
DECLARE @bar VARCHAR(MAX)
DECLARE @Nbar NVARCHAR(MAX) 

--SELECT Varbinary
SET @foo = CAST( '<Test>®</Test>' AS VARBINARY(MAX)) 
SELECT @foo AsBinary


--select as binary as varchar
SET @bar = CAST(@foo AS VARCHAR(MAX))

SELECT @bar BinaryAsVarchar                             -- Correct string output

--select binary as nvarchar
SET @nbar = CAST(@foo AS NVARCHAR(MAX))
SELECT @nbar BinaryAsNvarchar                           -- Chinese characters 

--select binary as XML
SELECT TRY_CAST(@foo AS XML) BinaryAsXML                -- ILLEGAL XML character
-- SELECT CONVERT(xml, @obfoo) BinaryAsXML                    --ILLEGAL XML Character

--select BinaryAsVarcharAsXML
SELECT TRY_CAST(@bar AS XML) BinaryAsVarcharAsXML       -- Correct Output

--select BinaryAsNVarcharAsXML
SELECT TRY_CAST(@nbar AS XML) BinaryAsNvarcharAsXML     -- Chinese Characters
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Evan Prickett
  • 11
  • 1
  • 2
  • `varchar` means ASCII, or at least, single-byte-encoded text. ® is outside the 0-127 range that isn't affected by codepages. Try with nvarchar and `CAST( N'®' as varbinary(max))`. `nvarchar` means UTF16 ie two bytes, which is why the conversion from varchar to varbinary to nvarchar failed. – Panagiotis Kanavos Nov 02 '18 at 17:15
  • 2
    Why are you doing *any* of those conversions anyway? Whatever problem you want to solve, mixing up types isn't going to help. If you have encoding errors, ensure you always use `nvarchar` fields, parameters *and* string literals. – Panagiotis Kanavos Nov 02 '18 at 17:16
  • BTW this means that the conversions that worked are actually wrong - they depend on using the same encoding both when converting to varbinary and back to text – Panagiotis Kanavos Nov 02 '18 at 17:18
  • 3
    Finally, why did the SSIS package store *text* into a *varbinary* column? That just *begs* for conversion issues. I suspect the initial author used `varchar`, run into conversion issues and instead of using the correct column collations or switching to `nvarchar`, just covered up the problem by using `varbinary`. That didn't fix anything, just moved the encoding problems to the reader. That worked as long as there were no non-Latin characters (which wouldn't have caused problems with varchar either). When the first non-Latin character was added, *boom*, the reader failed – Panagiotis Kanavos Nov 02 '18 at 17:19
  • Inlined transformations from question: `CAST(CAST( '®' AS VARBINARY(MAX)) AS NVARCHAR(MAX))` Uh, don't do that. The fundamental rule of character encodings is to read with the encoding that was used for writing. – Tom Blodget Nov 03 '18 at 19:04
  • @TomBlodget Although your statement is perfectly correct, I assume that the OP has to deal with rubbish produced by someone else. SQL-Server is not very mighty in dealing with encodings. Might be, that *read with the encoding that was used for writing* is impossible in this place... – Shnugo Nov 04 '18 at 10:14
  • @Shnugo Once it figured out how the data was mangled perhaps it can be reversed, using SQL CLR if necessary. – Tom Blodget Nov 04 '18 at 11:50
  • @TomBlodget That's what I mean... I many if my projects it would not be allowed to use CLR functions... Sometimes we have to go deal with rubbish. I think a multi-cast from varbin to varchar, then to nvarchar and finally to xml might solve this, as shown in my answer... – Shnugo Nov 04 '18 at 12:26
  • I think the original author used varbinary as a way to circumnavigate the string character limit in SSIS. After reading through everyone's comments, I tried a quick rework of the package + SQL table to use unicode string + nvarchar and ran into this error `Description: "Unicode data is odd byte size for column 11. Should be even byte size.".` From what I've read, you can solve this issue by explicitly defining an even string length, but then SSIS limits you to 4000 characters. – Evan Prickett Nov 13 '18 at 16:39
  • I think my solution is going to be to load the original XML directly into a SQL XML column using OPENROWSET and SQL bulk import in a script task – Evan Prickett Nov 13 '18 at 17:31

1 Answers1

0

There are several things to know:

  • SQL-Server is rather limited with character encodings. There is VARCHAR, which is 1-byte-encoded extended ASCII and NVARCHAR, which is UCS-2 (almost the same as utf-16).
  • VARCHAR uses plain latin for the first set of characters and a codepage-mapping provided by the collation in use for the second set.
  • VARCHAR is not utf-8. utf-8 works with VARCHAR, as long as all characters are 1-byte-enocded. But utf-8 knows a lot of 2-byte-enocded (up to 4-byte-enocded) characters, which would break the internal storage of a VARCHAR string.
  • NVARCHAR will work with almost any 2-byte encoded character natively (that means with almost any existing character). But it is not exactly utf-16 (there are 3-byte encoded characters, which would break SQL-Servers internal storage).
  • XML is not stored as the XML-string you see, but as an hierarchically organised physical table, based on NVARCHAR values.
  • The natively stored XML is really fast, while any text-based storage will need a very expensive parse-operation in advance (over and over...).
  • Storing XML as string is bad, storing XML as VARCHAR string is even worse.
  • Storing a VARCHAR-string-XML as VARBINARY is a cummulation of things you should not do.

Try this:

DECLARE @text1Byte VARCHAR(100)='<test>blah</test>';
DECLARE @text2Byte NVARCHAR(100)=N'<test>blah</test>';

SELECT CAST(@text1Byte AS VARBINARY(MAX)) AS text1Byte_Binary
      ,CAST(@text2Byte AS VARBINARY(MAX)) AS text2Byte_Binary
      ,CAST(@text1Byte AS XML) AS text1Byte_XML
      ,CAST(@text2Byte AS XML) AS text2Byte_XML
      ,CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS XML) AS text1Byte_XML_via_Binary
      ,CAST(CAST(@text2Byte AS VARBINARY(MAX)) AS XML) AS text2Byte_XML_via_Binary

The only difference you'll see are the many zeros in 0x3C0074006500730074003E0062006C00610068003C002F0074006500730074003E00. This is due to the 2-byte-encoding of nvarchar, each second byte is not needed in this sample. But if you'd need far-east-characters the picture would be completely different.

The reason why it works: SQL-Server is very smart. The cast from the variable to XML is rather easy, as the engine knows, that the underlying variable is varchar or nvarchar. But the last two casts are different. The engine has to examine the binary, whether it is a valid nvarchar and will give it a second try with varchar if it fails.

Now try to add your registered trademark to the given example. Add it first to the second variable DECLARE @text2Byte NVARCHAR(100)=N'<test>blah®</test>'; and try to run this. Then add it to the first variable and try it again.

What you can try:

Cast your binary to varchar(max), then to nvarchar(max) and finally to xml.

,CAST(CAST(CAST(CAST(@text1Byte AS VARBINARY(MAX)) AS VARCHAR(MAX)) AS NVARCHAR(MAX)) AS XML) AS text1Byte_XML_via_Binary

This will work, but it won't be fast...

Shnugo
  • 66,100
  • 9
  • 53
  • 114