As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually)
. There is no other format.
Within SQL-Server there is VARCHAR(MAX)
for extended ASCII (1-byte)
and NVARCHAR(MAX)
for unicode
. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX)
might not be able to deal with special characters... So - if this is an issue - you should stick with unicode
anyway.
The problem occurs, when the encoding declaration is included within <?xml ...?>
:
This works:
DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-8"?>
<root>test</root>';
SELECT @xml;
This produces an error:
DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT @xml;
But this works again (see the leading N
before the string literal):
DECLARE @xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
<root>test</root>';
SELECT @xml;
##Fazit
If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>
-declaration.
##UPDATE
You are mixing two things
##Encoding
From your comment:
UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...
Yes, it's correct, that UTF-8
and UTF-16
are two flavours of unicode
. But it is not correct to call utf-8
the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8
. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.
And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'
. With utf-16
the engine can cut byte 200 and 201 without looking, with utf-8
the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8
only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8
actually: extended ASCII in combination with a collation.
I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column
And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR
:
DECLARE @s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';
This works with any combination. You can declare NVARCHAR
and/or put an N
in front of the literal. No problem due to implicit conversions.
But internal VARCHAR
cannot deal with higher encodings!. Try this:
DECLARE @s NVARCHAR(MAX)=N'слов в тексте';
SELECT @s
This will work with NVARCHAR
and N'Your string'
only!
##XML-storage
As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16
. The xml-declaration is ommitted in any case...
The problem occurs, when you pass in a string which is physically encoded as utf-8
but declared as something else (or vice versa). You can pass in a real UTF-16
with a declared encoding of utf-16
(same with utf-8
) without problems.
##Fazit
If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.