9

The default encoding for an XML type field defined in an SQL Server is UTF-16. I have no trouble inserting into that field with UTF-16 encoded XML streams.

But if I tried to insert into the field with UTF-8 encoded XML stream, the insert attempt would receive the error response
unable to switch encoding.

QUESTION: Is there a way to define a SQL Server column/field as having UTF-8 encoding?

Further info

The insertion operations are performed using Spring JDBCTemplate.

The XML Stream was produced by JAXB Marshaller set to UTF-8 or UTF-16 encoding.

private String marshall(myDAO myTao, JAXBEncoding jaxbEncoding)
throws JAXBException{
    JAXBContext jc = JAXBContext.newInstance(ObjectFactory.class);
    m = jc.createMarshaller();
    m.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, Boolean.TRUE);
    if (jaxbEncoding!=null)
        m.setProperty(Marshaller.JAXB_ENCODING, jaxbEncoding.toString());
    StringWriter strw = new StringWriter();
    m.marshal(myTao, strw);
    String strw.toString();
}

Where ...

public enum JAXBEncoding {
    UTF8("UTF-8"),
    UTF16("UTF-16")
    ;
    
    private String value;
    private JAXBEncoding(String value){
        this.value = value;
    }
    
    public String toString(){
        return this.value;
    }
}
Community
  • 1
  • 1
Blessed Geek
  • 21,058
  • 23
  • 106
  • 176

4 Answers4

9

Is there a way to define a SQL Server column/field as having UTF-8 encoding?

No, the only Unicode encoding in SQL Server is UTF-16 Little Endian, which is how the NCHAR, NVARCHAR, NTEXT (deprecated as of SQL Server 2005 so don't use this in new development; besides, it sucks compared to NVARCHAR(MAX) anyway), and XML datatypes are handled. You do not get a choice of Unicode encodings like some other RDBMS's allow.

You can insert UTF-8 encoded XML into SQL Server, provided you follow these three rules:

  1. The incoming string has to be of datatype VARCHAR, not NVARCHAR (as NVARCHAR is always UTF-16 Little Endian, hence the error about not being able to switch the encoding).
  2. The XML has an XML declaration that explicitly states that the encoding of the XML is indeed UTF-8: <?xml version="1.0" encoding="UTF-8" ?>.
  3. The byte sequence needs to be the actual UTF-8 bytes.

For example, we can import a UTF-8 encoded XML document containing the screaming face emoji (and we can get the UTF-8 byte sequence for that Supplementary Character by following that link):

SET NOCOUNT ON;
DECLARE @XML XML = '<?xml version="1.0" encoding="utf-8"?><root><test>'
                    + CHAR(0xF0) + CHAR(0x9F) + CHAR(0x98) + CHAR(0xB1)
                    + '</test></root>';

SELECT @XML;
PRINT CONVERT(NVARCHAR(MAX), @XML);

Returns (in both "Results" and "Messages" tabs):

<root><test></test></root>

You mentioned in a comment on @Shnugo's answer:

I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem?

No, you didn't store UTF-8 encoded anything in an NVARCHAR column (besides, there is no 2013 version of SQL Server, but that is probably just a typo). NVARCHAR is only ever UTF-16 Little Endian. Most likely your UTF-8 stream got converted into UTF-16 LE by the database driver during transit into SQL Server. This is the same encoding that an XML column would use, but the XML column would have tried to convert the stream from UTF-8 into UTF-16 but failed due to it already being UTF-16. This also means that on the way out of SQL Server, the XML document stored in the NVARCHAR column would still have the XML declaration stating that the encoding is UTF-8, but it's definitely not UTF-8.

If you absolutely need the data to be UTF-8 on the way out because you don't want to convert the UTF-16 LE coming out of SQL Server XML or NVARCHAR into UTF-8, then you have no choice but to store the data as VARBINARY(MAX).

Solomon Rutzky
  • 46,688
  • 9
  • 128
  • 171
4

As you found out correctly, XML will be stored as unicode (utf-16, well, it's ucs-2 actually). There is no other format.

Within SQL-Server there is VARCHAR(MAX) for extended ASCII (1-byte) and NVARCHAR(MAX) for unicode. Both can be casted to XML directly (as long as the string is valid XML). One must be aware, that VARCHAR(MAX) might not be able to deal with special characters... So - if this is an issue - you should stick with unicode anyway.

The problem occurs, when the encoding declaration is included within <?xml ...?>:

This works:

DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-8"?>
 <root>test</root>';

SELECT @xml;

This produces an error:

DECLARE @xml XML =
'<?xml version="1.0" encoding="utf-16"?>
 <root>test</root>';

SELECT @xml;

But this works again (see the leading N before the string literal):

DECLARE @xml XML =
N'<?xml version="1.0" encoding="utf-16"?>
 <root>test</root>';

SELECT @xml;

##Fazit

If you pass the string 1-byte encoded, but declared as utf-16 (or vice-versa) you'll get into troubles. Best is, to pass your XML without the <?xml ...?>-declaration.

##UPDATE

You are mixing two things

##Encoding

From your comment:

UTF-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now...

Yes, it's correct, that UTF-8 and UTF-16 are two flavours of unicode. But it is not correct to call utf-8 the new de-facto standard. This depends heavily on your needs. Living in an english speaking country, dealing with plain latin text will save some bytes using UTF-8. Living somewhere far east will bloat your text incredibly, due to many 3 and 4 byte codes.

And - this is more important in terms of databases - the fixed width is enormously easier to handle. Just imagine a WHERE SUBSTRING(SomeUTF8Column,100,1)='A'. With utf-16 the engine can cut byte 200 and 201 without looking, with utf-8 the full string up to character 100 must be analysed to find out, where the 100th characters sits actually. I would prefer utf-8 only in cases, where band-width or storage space is an important factor... SQL Server uses a fixed width 1-byte encoding and no utf-8 actually: extended ASCII in combination with a collation.

I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column

And - this is even more important in terms of XML - XML is not stored as the text you see, rather as a hierarchy tree. You can store literally everything in (N)VARCHAR:

DECLARE @s VARCHAR(MAX)='Don''t store me, I''m UTF-16. Your machine will explode!';

This works with any combination. You can declare NVARCHAR and/or put an N in front of the literal. No problem due to implicit conversions.

But internal VARCHAR cannot deal with higher encodings!. Try this:

 DECLARE @s NVARCHAR(MAX)=N'слов в тексте';
 SELECT @s

This will work with NVARCHAR and N'Your string' only!

##XML-storage

As said before, XML is not stored as the text you see, but as a tree. Everything is optimized for performance. Therefore fixed width UTF-16. The xml-declaration is ommitted in any case...

The problem occurs, when you pass in a string which is physically encoded as utf-8 but declared as something else (or vice versa). You can pass in a real UTF-16 with a declared encoding of utf-16 (same with utf-8) without problems.

##Fazit

If you have the slightest chance to include 3 or 4 byte UTF-8 codes you should stick to UTF-16.

Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • UTG-8 is flexi-length unicode, that varies from 1 byte to 4 bytes in length. Whereas, UTF-16 is fixed length 2 byte unicode. UTF-8 seems the defacto unicode std now. The XML stream is produced by JAXB marshaller. I've had no problems inserting utf-8 encoded streams with utf-8 header into SQL Server 2013 NVARCHAR column. Would there be a hidden problem? – Blessed Geek Jan 06 '17 at 15:48
  • 3
    No downvote, but your answer is somewhat misleading in that **(1)** `varchar` is not "for utf-8", it is for the (single-byte character) code page defined for the SQL Server instance. Some character sets (e.g., "latin1") will look a lot like UTF-8 but they are *not* the same, and feeding UTF-8 to a `varchar` column will likely produce some mojibake (e.g., 'Montrél' instead of 'Montréal'). **(2)** UTF-8 and UTF-16 are not "two flavours of unicode", they are two *encodings* of Unicode: the characters (code points) are the same, it's their representation as a stream-of-bytes that is different. – Gord Thompson Jan 07 '17 at 18:09
  • 1
    (Further to the second point, many people find [this article](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) quite helpful.) – Gord Thompson Jan 07 '17 at 18:11
  • My xml streams have chinese, japanese, hebrew characters. When I used varchar, they were all gracefully and graciously "transformed" into question marks. – Blessed Geek Jan 08 '17 at 05:35
  • @BlessedGeek That is exactly what I meant with this: *This will work with NVARCHAR and N'Your string' only!*. The `VARCHAR`-type **is not able to handle special characters**. As you are dealing with *chinese, japanese, hebrew* language, **you ought to use `NVARCHAR`** – Shnugo Jan 08 '17 at 12:00
  • 2
    @GordThompson, Thank you very much for the link! I read this article with a mixture of *Ahh* and *shame* :-) Really enlightening and very well written. – Shnugo Jan 08 '17 at 15:56
  • 1
    @BlessedGeek, Taking all your comments together (SQL-Server serves as a buffer only) you might consider stuffing your UTF-8 encoded XML-string into a column of type `VARBINARY(MAX)`. You will not be able to do anything with this on SQL-Level (well, some fancy, *.Net-DLL-Assembly-methods* might be used), but you can push it in and call it out. That's very much enough for a buffer... – Shnugo Jan 08 '17 at 15:58
  • Excellent answer. One of the best and the most practical advice looks like: "Best is, to pass your XML without the ``-declaration". Normally you don't want to store representation (encoding) in database. The answer helped a lot. – Interface Unknown Jul 15 '17 at 16:02
  • UTF-16 is *not* fixed-length; it is merely less-variable-length than UTF-8. It is still possible to have a single code point take up two UTF-16 sequences (ie. anything outside the BMP), and also for a single visible character to be encoded as multiple code points (eg. decomposed accents). The main reason why SQL Server standardised on UTF-16 is because Windows itself has, and the main reason for that is that at the time Microsoft adopted Unicode, UTF-8 did not yet exist and neither did any characters outside the BMP. It is not a performance consideration. – Miral Jan 18 '18 at 23:44
  • @Miral true, nvarchar uses Unicode UCS-2, which ist - for almost any used charakter - the same as utf-16. – Shnugo Jan 18 '18 at 23:51
2

A 2-step works; first encode your UTF-8 to text or varchar(MAX) and then to xml.

convert(xml, convert(text, '<your UTF-8 xml>'))
Klompenrunner
  • 723
  • 7
  • 13
1

The "Type Casting String and Binary Instances" section of the MSDN document

Create Instances of XML Data

explains how incoming XML data is interpreted. Essentially,

  • if the SQL Server receives the XML data as nvarchar then it "assumes a two-byte unicode encoding such as UTF-16 or UCS-2",

  • if the SQL Server receives the XML data as varchar then by default it will use the (single-byte character set) code page defined for the SQL Server instance,

  • if the SQL Server receives the XML data as varbinary then it "is treated as a codepoint stream that is passed directly to the XML parser", and "an instance without BOM and without a declaration encoding will be interpreted as UTF-8".

If your marshalling code is spitting out a Java String to be sent to the SQL Server then it is very likely being sent as nvarchar since a Java String is always a Unicode string. That would explain why the SQL Server assumes UTF-16 encoding.

If you really need to send the XML data to the SQL Server with UTF-8 encoding (though I can't imagine why) then your marshalling code probably needs to produce a stream of (UTF-8 encoded) bytes that will be sent to the SQL Server as varbinary.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • *"If you really need to send the XML data with UTF-8 encoding (though I can't imagine why)"* - that is because I need to send the downstream too-many-of-them consumers, the preamble string "**" because I have no assurance that they can deal with *encoding=utf-16*. – Blessed Geek Jan 08 '17 at 05:27
  • I am using sql server as the frontend buffer for these downstream consumers. If I had to include more character/byte processing into transmission of the xml stream, that would defeat the purpose of having a frontend buffer. – Blessed Geek Jan 08 '17 at 05:29
  • @BlessedGeek - If you are just using SQL Server storage as a cache for the XML data then perhaps you don't really need to store the XML in a true `xml` column; you could simply store it "ready to serve" (i.e., properly encoded *and escaped*) in an `[n]varchar(max)` column. See the "XML Storage Options" section of [this MSDN document](https://msdn.microsoft.com/en-us/library/hh403385.aspx) for a discussion on whether a true `xml` column will offer any real benefit for your particular usage case. – Gord Thompson Jan 08 '17 at 13:29