SQL - UTF-8 to varchar/nvarchar Encoding issue

Question

Problem background - I am receiving response data from a website, formatted in json and UTF-8 encoded. A body attribute of json has values in a base64binary type, and I store it as nvarchar on ms sql server.

That base64binary data when converted to varchar or nvarchar contains funny characters(in places of double quotes etc.), indicating that there is an encoding issue - This is reason I post this question.

See dissected code as follows and a runnable example at the bottom, with comments.

Notice funny characters during conversion.

eg. On behalf of IRB Holding Corp (the â€œCompany")

The following query fixes above issue - I see quotes as they should appear, but then it fails on rows containing '&' , which is a special character in xml.

select    convert(xml,  '<?xml version="1.0" encoding="UTF-8"?>' + convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')))

The following query handles above issue by using replace statements and I am able to completely see all the rows as expected. But this solution will only handle the '&' s.

Example code to run:

    declare @t table ( [body] nvarchar(max) ) 
    
    insert into @t(body) 
    select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA=='
    
    select convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))
        , convert(xml, '<?xml version="1.0" encoding="UTF-8"?>'+ replace(convert(varchar(max),convert(varchar(max),cast('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)'))),'&','&amp;')) 
from @t

The question - Will I have to add more replace statements for other xml special characters - < , > ?

Possible duplicate of [Convert text value in SQL Server from UTF8 to ISO 8859-1](https://stackoverflow.com/questions/28168055/convert-text-value-in-sql-server-from-utf8-to-iso-8859-1) — GSerg, May 17 '19 at 08:03
@GSerg, That's a great workaround if performance does not matter that much. Thx for the link! I tried the function with the given string and got back `DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Park” `, which looks pretty convincing. — Shnugo, May 17 '19 at 08:22
@Shnugo None of the workarounds (and they all are) are going to be performant. The fastest of the workarounds *should* be passing `varbinary(max)` to a CLR function and calling `Utf8.GetString` on it. The fastest in principle should be `cast(cast(varbinary as varchar(max)) collate LATIN1_GENERAL_100_CI_AS_SC_UTF8 as nvarchar(max))` in SQL Server 2019. — GSerg, May 17 '19 at 08:32
@GSerg True... But the fastest was, to know all this and therefore avoid UTF-8 in database storage entirely. I've just added some lines to my answer to reflect this. — Shnugo, May 17 '19 at 08:34
GSerg and Snugo, thanks for your explanation. Learned something new as well. Adding the cdata block converted all the records in my table correctly - for now pitfalls of xml are not in my path, will see how it goes as I continue to pull more data, yes I need to read up on how cdata works. — J Sidhu, May 17 '19 at 16:02

GSerg · Accepted Answer · 2019-05-17T10:12:32.153

5

The XML trick works fine, just let the XML engine handle the character entities:

declare @t table ([body] nvarchar(max));

insert into @t(body) 
values ('REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==');

select
    cast(
        cast('<?xml version="1.0" encoding="UTF-8"?><root><![CDATA[' as varbinary(max))
        +
        CAST('' as xml).value('xs:base64Binary(sql:column("body"))', 'VARBINARY(MAX)')
        +
        cast(']]></root>' as varbinary(max))
    as xml).value('.', 'nvarchar(max)')
from
@t;

The important parts here are:

The absence of N in front of the string literals
The encoding="UTF-8"
The fact that we know that the characters from the XML declaration element have the same UTF-8 representation as they do in latin1, so casting them to varbinary gives valid UTF-8
The <![CDATA]]> block.

Note that it is still no more than a hack. As soon as you involve XML, you are subject to the XML limitations, and if your string contains characters not representable in XML, that type of XML conversion is going to fail with

XML parsing: line 1, character 54, illegal xml character

edited May 17 '19 at 10:12

answered May 17 '19 at 08:18

GSerg

76,472
17
159
346

This is great, +1 form my side. I added an update section to my answer where I use this approach. Telling people for years, that SQL-Server cannot read utf-8 seems to be wrong :-) – Shnugo May 17 '19 at 09:58
@Shnugo It is still a hack. There are characters that are [not representable in XML](https://stackoverflow.com/a/5110103/11683), even in CDATA sections. This code will fail if the string contains e.g. `char(8)` or `char(11)`. – GSerg May 17 '19 at 10:10
Yes, I know that there are certain characters left... But I think, that 99,9% of all the utf-8 strings around would work. Btw: [I once placed an answer](https://stackoverflow.com/a/38117491/5089204) how you can include this invalid characters into XML... But this is rather academic... – Shnugo May 17 '19 at 10:15
@Shnugo Apparently this also messes with the whitespace (compresses multiple spaces into one, replaces CrLf with Lf etc). Applying `xml:space="preserve"` to `root` does not seem to fix it as the problem I believe happens at the inner `CAST('' as xml).value(...)`. – GSerg Oct 28 '20 at 11:21
you need to replace the `]]>` string with `]]]]><![CDATA[>`, otherwise this will not work universally – andowero May 23 '21 at 11:02

score 4 · Answer 2 · edited Jun 20 '20 at 09:12

UPDATE: I just learnd something new, which is - uhm - great :-)

Try this function

CREATE FUNCTION dbo.Convert_utf8(@utf8 VARBINARY(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @rslt NVARCHAR(MAX);

    SELECT @rslt=
    CAST(
          --'<?xml version="1.0" encoding="UTF-8"?><![CDATA['
          0x3C3F786D6C2076657273696F6E3D22312E302220656E636F64696E673D225554462D38223F3E3C215B43444154415B
          --the content goes within CDATA
        + @utf8
        --']]>'
        + 0x5D5D3E
    AS XML).value('.', 'nvarchar(max)');

    RETURN @rslt;
END
GO

And call it like this

SELECT *
      ,dbo.Convert_utf8(CAST(t.body AS XML).value('.','varbinary(max)'))
FROM @t t;

The result is

DALLAS, TX – May 7, 2019 – Covey & Park Energy Holdings LLC (“Covey Park”

GSerg, thank you very much ! for your answer below. I tried around and simplified this to work within an UDF.

It looks as if the cast of a varbinary(max) to XML is completely done in a CLR environment, where the XML's encoding declaration is taken into account. This seems to work with other encodings too, but I don't have the time right now, to test this generically.

Now the rest of the answer

As it contains some background about string encoding, which might be worth to read.

I simplified your code a bit:

declare @t table ( [body] nvarchar(max) ) 

insert into @t(body) 
select 'REFMTEFTLCBUWCDigJMgTWF5IDcsIDIwMTkg4oCTIENvdmV5ICYgUGFyayBFbmVyZ3kgSG9sZGluZ3MgTExDICjigJxDb3ZleSBQYXJr4oCdIA==';

SELECT  CAST(t.body AS XML).value('.','varbinary(max)')
       ,CAST(CAST(t.body AS XML).value('.','varbinary(max)') AS VARCHAR(MAX))
FROM @t t;

You will see this result

0x44414C4C41532C20545820E28093204D617920372C203230313920E2809320436F7665792026205061726B20456E6572677920486F6C64696E6773204C4C432028E2809C436F766579205061726BE2809D20  
DALLAS, TX â€“ May 7, 2019 â€“ Covey & Park Energy Holdings LLC (â€œCovey Parkâ€

I'll place the first characters more reader friendly

0x44414C4C41532C20545820E28093  
   D A L L A S ,   T X   â € “

The 0x44 is the D, twice the 0x4C is the doubled LL, and after the space 0x20 we get to E28093. This is a 3-byte encoded code point for the en dash. SQL-Server will not help you with this... It will interpret this into 3 characters of 1 byte each...

I'm afraid, you'r out of luck...

SQL-Server does not support utf-8 strings. There is limited support with BCP / BULK to enable input from the file system, but a string within T-SQL must be one of the two supported options:

(var)char, which is extended ASCII. It is strictly one-byte-per-character and will need a collation to deal with a limited set of foreign characters.
n(var)char, which is UCS-2 (very similar to UTF-16). It is strictly two-bytes-per-character and will encode (almost) any known character at the price of doubled size in memory.

UTF-8 is compatible with (var)char, as long as we stick with plain latin and to one-byte-codes. But any ASCII code above 127 will lead into troubles (might work with the right collation). But - this is your case here - your string uses multi-byte-code-points. UTF-8 will encode a lot of characters with two or even more bytes (up to 4!) for one single character.

What you can do

You will have to use some engine capable to deal with UTF-8

a CLR-function
Export to a file and re-import using the limited support (needs v2014 SP2 or higher)
Use an external tool (PowerShell, C#, any programming language you know)

And - thx to @GSerg - two more options:

Wait for v2019. There will be special collations allowing for native support of utf-8 in T-SQL-strings
This answer provides an UDF, which can transform UTF8 to NVARCHAR. It won't be fast, but it works.

General remark

A database can hold storage-data just as is, or working-data, you want to use in the one or the other way. Storing a picture as VARBINARY(MAX) is just a chunk of bits. You would not try to use SQL-Server to perform image reckognition.

This is the same with text data. If you just store a chunk of text, it won't matter, how you do this. But if you want to use this text for filtering, searching or if you want to use SQL-Server to display this text, you must think about the format and the needs for performance.

An enocding with variable byte lengths will not allow a simple SUBSTRING('blahblah',2,3). With fixed length the engine can just take the string as an array, jump to the second index and pick the next three characters. But with variable bytes the engine will have to compute the index by checking all characters before, if there might be any multi-byte code point. This will slow down a lot of string methods extremly...

Best was, not to store data in a format, the SQL-Server cannot handle (well)...

4. Wait for SQL Server 2019 than [supports UTF-8 in varchars](https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sqlallproducts-allversions#utf-8-support). — GSerg, May 17 '19 at 07:51
@GSerg Yes, but you will have to use special collations for this feature. Quite a draw back... And it will be very slow in any filter or search action. The best will be: Try to avoid utf8 in databases... — Shnugo, May 17 '19 at 08:01
I'm not very happy with your latest edit. `substring` works in characters, not bytes, and that is not going to change with UTF8. UTF16, which has been supported for a long while, is also a variable length encoding (from 2 to 4 bytes per character), and that has not been causing problems either. — GSerg, May 17 '19 at 08:39
@GSerg I'm pretty sure, that under the hood a `substring` is reading from an array. If this is *plain data* this will work with memory offsets lightning fast (probably pointer arithmetics). But with UTF16 or UTF8 the engine will have to transform this either into `NVARCHAR` with some *fancy extras* or some kind of character collection in memory (probably a linked list). Do you know details about the actual implementation of `substring()`? Would be interesting... — Shnugo, May 17 '19 at 08:48
You are probably right about that. With `declare @s nvarchar(20) = N''`, which takes two UTF-16 code units (four bytes), `substring(@s, 1, 1)` returns the first surrogate, not the entire character. But that is the behaviour that it had for ages, so it's not going to become any worse with UTF-8. — GSerg, May 17 '19 at 08:55
@GSert, no, it won't become any worse, but it will return with unexpected results. Or it will get slow, if next SQL-Server changes this behaviour due to the collation. — Shnugo, May 17 '19 at 10:04

score 1 · Answer 3 · answered May 23 '21 at 11:00

If you had SQL server 2019, you could create another database with UTF8 as default collation and create simple function there:

USE UTF8_DATABASE
GO

CREATE OR ALTER FUNCTION dbo.VarBinaryToUTF8
  (@UTF8 VARBINARY(MAX))
  RETURNS VARCHAR(MAX)
AS
BEGIN
  RETURN CAST(@UTF8 AS VARCHAR(MAX));
END;

You would than call

SELECT
  UTF8_DATABASE.dbo.VarBinaryToUTF8
  (
    CAST('' as xml).value('xs:base64Binary(sql:column("body"))','varbinary(max)')
  )
FROM
  @t

This works because SQL server uses the default collation of specific database for its variables and function return values. You have to store the result into NVARCHAR or UTF8 collated 'VARCHAR in your non-UTF8` database.

SQL - UTF-8 to varchar/nvarchar Encoding issue

3 Answers3

UPDATE: I just learnd something new, which is - uhm - great :-)

Now the rest of the answer

What you can do

General remark

Linked