Convert utf-8 encoded varbinary(max) data to nvarchar(max) string

Question

Is there a simple way to convert a utf-8 encoded varbinary(max) column to varchar(max) in T-SQL. Something like CONVERT(varchar(max), [MyDataColumn]). Best would be a solution that does not need custom functions. Currently, i convert the data on the client side, but this has the downside, that correct filtering and sorting is not as efficient as done server-side.

*Don't* use a binary field to store text. Trying to convert UTF8 to ASCII/single-byte codepage (that's what varchar is) will always fail unless the string was *already* in the US-ASCII codepage — Panagiotis Kanavos, Feb 11 '19 at 08:24
I believe you have a varchar value stored as a varbinary? If not the result of the varchar will be Gibberish. — EzLo, Feb 11 '19 at 08:24
The simplest solution is to store Unicode text in Unicode types. `nvarchar(...) ` for simple text, `nvarchar(max)` for CLOBs. No functions or conversions necessary, no chance of codepage conversion errors. Your client code libraries or drivers will be able to just store the Unicode text without complications — Panagiotis Kanavos, Feb 11 '19 at 08:25
What version of SQL Server are you running? Since UTF-8 support was only added in the 2019 version, it's highly likely you're running a version of SQL Server that *doesn't know anything about UTF-8*. — Damien_The_Unbeliever, Feb 11 '19 at 08:27
@Damien_The_Unbeliever *will be added*. 2019 isn't out yet and the current previews still have incomplete UTF8 support. I wonder why the OP wants UTF8. An attempt to migrate another database that doesn't have `nvarchar` ? Or just an assumption that Unicode means UTF8? — Panagiotis Kanavos, Feb 11 '19 at 08:30
@sschoenb why do you want to store Unicode text as binary anyway? This doesn't offer any advantages whilw introducing a *lot* of problems. What problem are you trying to solve? — Panagiotis Kanavos, Feb 11 '19 at 09:28
it was not my choice to store UTF-8 content as varbinary. A third party stores json data as utf-8 varbinary — Stefan Schönbächler, Feb 11 '19 at 09:33
@sschoenb that's rather ... unfortunate, as it prevents you from using SQL Server's JSON functions too. The Big 3 databases (DB2, Oracle, SQL Server) always supported `nchar/nvarchar`. Which suggest that third party started from MySQL/PostgreSQL instead. — Panagiotis Kanavos, Feb 11 '19 at 09:50
@sschoenb what you can do is create another `nvarchar` field that gets updated eg by a trigger, to hold the UTF16 text. This will allow you to apply JSON functions, index it, specify collations etc. You can use transparent compression on the table (available even in SQL Server Express since 2016SP1) to reduce space and actually improve performance by reducing IO — Panagiotis Kanavos, Feb 11 '19 at 09:53

andowero · Answer 1 · 2022-10-17T09:04:33.063

XML trick

Following solution should work for any encoding.

There is a tricky way of doing exactly what the OP asks. Edit: I found the same method discussed on SO (SQL - UTF-8 to varchar/nvarchar Encoding issue)

The process goes like this:

SELECT
  CAST(
    '<?xml version=''1.0'' encoding=''utf-8''?><![CDATA[' --start CDATA
    + REPLACE(
      LB.LongBinary,
      ']]>', --we need only to escape ]]>, which ends CDATA section
      ']]]]><![CDATA[>' --we simply split it into two CDATA sections
    ) + ']]>' AS XML --finish CDATA
  ).value('.', 'nvarchar(max)')

Why it works: varbinary and varchar are the same string of bits - only the interpretation differs, so the resulting xml truly is utf8 encoded bitstream and the xml interpreter is than able to reconstruct the correct utf8 encoded characters.

BEWARE the 'nvarchar(max)' in the value function. If you used varchar, it would destroy multi-byte characters (depending on your collation).

BEWARE 2 XML cannot handle some characters, i.e. 0x2. When your string contains such characters, this trick will fail.

Database trick (SQL Server 2019 and newer)

This is simple. Create another database with UTF8 collation as the default one. Create function that converts VARBINARY to VARCHAR. The returned VARCHAR will have that UTF8 collation of the database.

Insert trick (SQL Server 2019 and newer)

This is another simple trick. Create a table with one column VARCHAR COLLATE ...UTF8. Insert the VARBINARY data into this table. It will get saved correctly as UTF8 VARCHAR. It is sad that memory optimized tables cannot use UTF8 collations...

Alter table trick (SQL Server 2019 and newer)

(don't use this, it is unnecessary, see Plain insert trick)

I was trying to come up with an approach using SQL Server 2019's Utf8 collation and I have found one possible method so far, that should be faster than the XML trick (see below).

Create temporary table with varbinary column.
Insert varbinary values into the table
Alter table alter column to varchar with utf8 collation

drop table if exists
  #bin,
  #utf8;
    
create table #utf8 (UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8);
create table #bin (BIN VARBINARY(MAX));
    
insert into #utf8 (UTF8) values ('Žluťoučký kůň říčně pěl ďábelské ódy za svitu měsíce.');
insert into #bin (BIN) select CAST(UTF8 AS varbinary(max)) from #utf8;
    
select * from #utf8; --here you can see the utf8 string is stored correctly and that
select BIN, CAST(BIN AS VARCHAR(MAX)) from #bin; --utf8 binary is converted into gibberish
    
alter table #bin alter column BIN varchar(max) collate Czech_100_CI_AI_SC_UTF8;
select * from #bin; --voialá, correctly converted varchar
    
alter table #bin alter column BIN nvarchar(max);
select * from #bin; --finally, correctly converted nvarchar

Speed difference

The Database trick together with the Insert trick are the fastest ones.
The XML trick is slower.
The Alter table trick is stupid, don't do it. It loses out when you change lots of short texts at once (the altered table is large).

The test:

first string contains one replace for the XML trick
second string is plain ASCII with no replaces for XML trick
@TextLengthMultiplier determines length of the converted text
@TextAmount determines how many of them at once will be converted

------------------
--TEST SETUP
--DECLARE @LongText NVARCHAR(MAX) = N'český jazyk, Tiếng Việt, русский язык, 漢語, ]]>';
--DECLARE @LongText NVARCHAR(MAX) = N'JUST ASCII, for LOLZ------------------------------------------------------';

DECLARE
  @TextLengthMultiplier INTEGER = 100000,
  @TextAmount           INTEGER = 10;

---------------------
-- TECHNICALITIES
DECLARE
  @StartCDATA  DATETIME2(7), @EndCDATA  DATETIME2(7),
  @StartTable  DATETIME2(7), @EndTable  DATETIME2(7), 
  @StartDB     DATETIME2(7), @EndDB     DATETIME2(7),
  @StartInsert DATETIME2(7), @EndInsert DATETIME2(7);

drop table if exists
  #longTexts,
  #longBinaries,
  #CDATAConverts,
  #DBConverts,
  #INsertConverts;

CREATE TABLE #longTexts      (LongText   VARCHAR  (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #longBinaries   (LongBinary VARBINARY(MAX)                                 NOT NULL);
CREATE TABLE #CDATAConverts  (LongText   VARCHAR  (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #DBConverts     (LongText   VARCHAR  (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);
CREATE TABLE #InsertConverts (LongText   VARCHAR  (MAX) COLLATE Czech_100_CI_AI_SC_UTF8 NOT NULL);

insert into #longTexts --make the long text longer
  (LongText)
select
  REPLICATE(@LongText, @TextLengthMultiplier)
from 
  TESTES.dbo.Numbers --use while if you don't have number table
WHERE
  Number BETWEEN 1 AND @TextAmount; --make more of them

insert into #longBinaries (LongBinary) select CAST(LongText AS varbinary(max)) from #longTexts;

--sanity check...
SELECT TOP(1) * FROM #longTexts;

------------------------------
--MEASURE CDATA--
SET @StartCDATA = SYSDATETIME();
INSERT INTO #CDATAConverts 
  (
    LongText
  )
SELECT
  CAST(
    '<?xml version=''1.0'' encoding=''utf-8''?><![CDATA['
    + REPLACE(
      LB.LongBinary,
      ']]>',
      ']]]]><![CDATA[>'
    ) + ']]>' AS XML
  ).value('.', 'Nvarchar(max)')
FROM
  #longBinaries AS LB;
SET @EndCDATA = SYSDATETIME();

--------------------------------------------
--MEASURE ALTER TABLE--
SET @StartTable = SYSDATETIME();
DROP TABLE IF EXISTS #AlterConverts;
CREATE TABLE #AlterConverts (UTF8 VARBINARY(MAX));

INSERT INTO #AlterConverts 
  (
    UTF8
  )
SELECT
  LB.LongBinary
FROM
  #longBinaries AS LB;

ALTER TABLE #AlterConverts ALTER COLUMN UTF8 VARCHAR(MAX) COLLATE Czech_100_CI_AI_SC_UTF8;
--ALTER TABLE #AlterConverts ALTER COLUMN UTF8 NVARCHAR(MAX);

SET @EndTable = SYSDATETIME();

--------------------------------------------
--MEASURE DB--
SET @StartDB = SYSDATETIME();

INSERT INTO #DBConverts 
  (
    LongText
  )
SELECT
  FUNCTIONS_ONLY.dbo.VarBinaryToUTF8(LB.LongBinary)
FROM
  #longBinaries AS LB;

SET @EndDB = SYSDATETIME();

--------------------------------------------
--MEASURE Insert--
SET @StartInsert = SYSDATETIME();

INSERT INTO #INsertConverts 
  (
    LongText
  )
SELECT
  LB.LongBinary
FROM
  #longBinaries AS LB;

SET @EndInsert = SYSDATETIME();

--------------------------------------------
-- RESULTS
SELECT
  DATEDIFF(MILLISECOND, @StartCDATA, @EndCDATA) AS CDATA_MS,
  DATEDIFF(MILLISECOND, @StartTable, @EndTable) AS ALTER_MS,
  DATEDIFF(MILLISECOND, @StartDB, @EndDB) AS DB_MS,
  DATEDIFF(MILLISECOND, @StartInsert, @EndInsert) AS Insert_MS;

SELECT TOP(1) '#CDATAConverts ', * FROM #CDATAConverts ;
SELECT TOP(1) '#DBConverts    ', * FROM #DBConverts    ;
SELECT TOP(1) '#INsertConverts', * FROM #INsertConverts;
SELECT TOP(1) '#AlterConverts ', * FROM #AlterConverts ;

So, I have digged around a little bit more and came up with 3 more solutions. I even sped up the famous XML trick. My advice would be: use CLR function... — andowero, May 23 '21 at 10:39
Nice idea, but it seems to have problems with invalid characters. I receive Msg 9420 - invalid XML character at position ... — Herbert, Jul 27 '22 at 15:23
@Herbert: Could you provide an example? Try to find the offending character. Did you try the Insert trick or Database trick? — andowero, Jul 28 '22 at 07:09
I just used the XML Trick "native" as in the code sample above. In this case it is an utf-8 coded Text which is stored as image. Converting it in standard way like: CONVERT(varchar(5000),CAST([Text_ASCII_7] AS varbinary(max)),0) AS Titel_Text does work as expected: it delivers utf-8 encoded character string in ascii format with all single bytes as a character. I can't evaluate for invalid characters as I have to process a large number of ever changing strings. — Herbert, Jul 29 '22 at 11:09
Well, if your document contains characters illegal in XML, than the XML trick needs to be ammended. As per [this answer](https://stackoverflow.com/a/21098537/3434168), there are multiple characters that just cannot be in an XML. I would suggest the "insert" trick, if you cannot create new database, or the "database trick", if you can. — andowero, Aug 02 '22 at 13:36

Shnugo · Accepted Answer · 2019-02-11T08:48:57.257

7

SQL-Server does not know UTF-8 (at least all versions you can use productivly). There is limited support starting with v2014 SP2 (and some details about the supported versions) when reading an utf-8 encoded file from disc via BCP (same for writing content to disc).

Important to know:

VARCHAR(x) is not utf-8. It is 1-byte-encoded extended ASCII, using a codepage (living in the collation) as character map.

NVARCHAR(x) is not utf-16 (but very close to it, it's ucs-2). This is a 2-byte-encoded string covering almost any known characters (but exceptions exist).

utf-8 will use 1 byte for plain latin characters, but 2 or even more bytes to encoded foreign charsets.

A VARBINARY(x) will hold the utf-8 as a meaningless chain of bytes.

A simple CAST or CONVERT will not work: VARCHAR will take each single byte as a character. For sure this is not the result you would expect. NVARCHAR would take each chunk of 2 bytes as one character. Again not the thing you need.

You might try to write this out to a file and read it back with BCP (v2014 SP2 or higher). But the better chance I see for you is a CLR function.

edited Feb 11 '19 at 08:48

answered Feb 11 '19 at 08:41

Shnugo

66,100
9
53
114

[nvarchar is UTF16](https://learn.microsoft.com/en-us/sql/t-sql/data-types/nchar-and-nvarchar-transact-sql?view=sql-server-2017) since SQL Server 2012 as long as a [Supplementary Characters, _SC](https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-2017#Supplementary_Characters) collation is used – Panagiotis Kanavos Feb 11 '19 at 09:26
@PanagiotisKanavos, yeah... I've stumbled accross this already. But I'd assume, that quite nobody is using these `_SC` collations, at least not by default. Changing a database's collation is a really tough action. Writing these collations to each and any string you need is a very ugly and time consuming process too... Furthermore, I think, the docs are not absolutely precise, by talking about `NVARCHAR` as a *unicode / utf-16* equivalent. Well, in 99,9% yes, but not any UTF-16 string would work with a cast to `VARBINARY` and back to `NVARCHAR`... – Shnugo Feb 11 '19 at 10:45
Probably because we haven't encountered any issues, even with emojis. They may not be in UCS-2 but as the docs explain, they won't get mangled. – Panagiotis Kanavos Feb 11 '19 at 10:52

score 0 · Answer 3 · answered Jan 28 '20 at 09:31

you can use the following to post string into varbinary field

Encoding.Unicode.GetBytes(Item.VALUE)

then use the following to retrive data as string

public string ReadCString(byte[] cString)
        {
            var nullIndex = Array.IndexOf(cString, (byte)0);
            nullIndex = (nullIndex == -1) ? cString.Length : nullIndex;
            return System.Text.Encoding.Unicode.GetString(cString);
        }