4

A piece of T-SQL doesn't behave the same from Production to Test environment. When the code below is executed on prod it brings back data

SELECT [col1xml]
FROM [DBName].[dbo].[Table1] (NOLOCK)
WHERE (cast([col1xml] as xml).value('(/Payment/****/trn1)[1]','nvarchar(20)') ='123456'))

However that same code brings back the below error when ran in Test.

Msg 9402, Level 16, State 1, Line 9 XML parsing: line 1, character 38, unable to switch the encoding

I have seen the fix provided by this site of conversion of UTF and this works in both prod and test. See below. However I need to provide an answer to the developers of why this behavior is occurring and a rationale why they should change their code(if that is the case)

WHERE CAST(
REPLACE(CAST(col1xml AS VARCHAR(MAX)), 'encoding="utf-16"', 'encoding="utf-8"')
AS XML).value('(/Payment/****/trn1)[1]','NVARCHAR(max)') ='123456')

I have compared both DB's and looked for anything obvious such as ANSI nulls and ANSI padding. Everything is the same and the version of SQL Server. This is SQL Server 2012 11.0.5388 version. Data between environments is different but the table schema is identical and the data type for col1xml is ntext.

Dale K
  • 25,246
  • 15
  • 42
  • 71
ADTJOB
  • 85
  • 3
  • 13
  • Because the encoding of the XML is (implicitly or explicitly) fully determined by the underlying string type, your XML documents should *not* contain encoding directives -- these do nothing but take up space and potentially trip up the parser. You're best off stripping these entirely on storing them, if you can't avoid getting them (that is, don't replace them with `encoding="utf-8"`, replace them with *nothing*). Most XML libraries can be convinced to not output an XML declaration, or at least not one with an encoding. – Jeroen Mostert Jul 03 '17 at 22:40
  • [Bad habits: putting `nolock` everywhere](https://www.sentryone.com/blog/aaronbertrand/bad-habits-nolock-everywhere) – Dale K May 20 '23 at 04:11

1 Answers1

4

In SQL Server you should store XML in a column typed XML. This native type has a got a lot of advantages. It is much faster and has implicit validity checks.

From your question I take, that you store your XML in NTEXT. This type is deprecated for centuries and will not be supported in future versions! You ought to change this soon!

SQL Server knows two kinds of strings:

  • 1 byte strings (CHAR or VARCHAR), which is extended ASCII
    Important: This is not UTF-8! Native UTF-8 support will be part of a coming version.
  • 2 byte string (NCHAR or NVARCHAR), which is UTF-16 (UCS-2)

If the XML has a leading declaration with an encoding (in most cases this is utf-8 or utf-16) you can get into trouble.

If the XML is stored as 2-byte-string (at least the NTEXT tells me this), the declaration has to be utf-16. With a 1-byte-string it should be utf-8.

The best (and easiest) was to omit the declaration completely. You do not need it. Storing the XML in the appropriate type will kill this declaration automatically.

What you should do: Create a new column of type XML and shuffle all your XMLs to this column. Get rid of any TEXT, NTEXT and IMAGE columns you might have!

The next step is: Be happy and enjoy the fast and easy going with the native XML type.

UPDATE Differences in environment

Data between environments is different

The error happens here:

cast([col1xml] as xml)

If your column would store the XML in the native type, you would not need a cast (which is very expensive!!) at all. But in your case this cast depends on the actual XML. As this is stored in NTEXT it is 2-byte-string. If your XML starts with a declaration stating a non-supported encoding (in most cases utf-8), this will fail.

Try this which works:

DECLARE @xml2Byte_UTF16 NVARCHAR(100)='<?xml version="1.0" encoding="utf-16"?><root>test1</root>';
SELECT CAST(@xml2Byte_UTF16 AS XML);

DECLARE @xml1Byte_UTF8 VARCHAR(100)='<?xml version="1.0" encoding="utf-8"?><root>test2</root>';
SELECT CAST(@xml1Byte_UTF8 AS XML);

This fails

DECLARE @xml2Byte_UTF8 NVARCHAR(100)='<?xml version="1.0" encoding="utf-8"?><root>test3</root>';
SELECT CAST(@xml2Byte_UTF8 AS XML);

DECLARE @xml1Byte_UTF16 VARCHAR(100)='<?xml version="1.0" encoding="utf-16"?><root>test4</root>';
SELECT CAST(@xml1Byte_UTF16 AS XML);

Play around with VARCHAR and NVARCHAR and utf-8 and utf-16...

Dale K
  • 25,246
  • 15
  • 42
  • 71
Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • `CHAR` and `VARCHAR` contain "non-Unicode data". The collations used are not required to be ASCII-compatible and calling them "extended ASCII" is misleading. Point in case: `SQL_EBCDIC037_CP1_CS_AS`. (I do think all supported collations are single-byte, at least.) Also, do you have a source for the claim that UTF-8 will be supported for `CHAR` storage in a future version? All I know of is support for code page 65001 in bulk import/export operations from SQL Server 2016 onwards, which is not the same thing. – Jeroen Mostert Jul 03 '17 at 22:34
  • Thanks. This is all useful information. But i still cant spot the behavior difference between environments. I checked the collation at DB, Table and column level. Ill definitely feedback the XML data type and ntext being deprecated. – ADTJOB Jul 04 '17 at 09:36
  • @ADTJOB [Find further information here](https://stackoverflow.com/a/41505440/5089204) – Shnugo Jul 04 '17 at 12:03
  • @JeroenMostert Uhm... This [Connect article](https://connect.microsoft.com/SQLServer/feedback/details/362867/add-support-for-storing-utf-8-natively-in-sql-server) does not support my statement. Due to the nature of a database it is quite clear, why a fixed byte width wins... I ran over a *What's new in SQL Server 2016?* list and saw the headline *support for UTF-8*, but you are correct: [this is not touching normal string handling, just BCP, BULK etc.](https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/what-s-new-in-sql-server-2016-database-engine#UTF8) – Shnugo Jul 04 '17 at 12:24
  • @ADTJOB See my update to explain the differences in environment – Shnugo Jul 04 '17 at 14:15