0

I'm using both the following methods to encode in base 64 a Chinese string. Problem is that I'm having Pz8= as output, which decoded is ??

What's wrong with this and how can I fix it?

Method 1

CREATE FUNCTION [dbo].[base64Encode] (@input VARCHAR(MAX))
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @output NVARCHAR(MAX),
        @bits VARBINARY(3),
        @pos INT

    SET @pos = 1
    SET @output = ''

    WHILE @pos <= LEN(@input)
    BEGIN
        SET @bits = CONVERT(VARBINARY(3), SUBSTRING(@input, @pos, 3))
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 1, 1) / 4 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 1, 1) % 4 * 16 + SUBSTRING(@bits, 2, 1) / 16 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 2, 1) % 16 * 4 + SUBSTRING(@bits, 3, 1) / 64 + 1, 1)
        SET @output = @output + SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/', SUBSTRING(@bits, 3, 1) % 64 + 1, 1)
        SET @pos = @pos + 3
    END

    RETURN (LEFT(@output, LEN(@output) - 3 + LEN(@bits)) + REPLICATE('=', 3 - LEN(@bits)))
END

SELECT [dbo].[base64Encode]('你好')

Method 2

SELECT CAST('你好' as varbinary(max)) FOR XML PATH(''), BINARY BASE64
CDspace
  • 2,639
  • 18
  • 30
  • 36
Mark
  • 645
  • 2
  • 9
  • 27
  • You're literal string is a `varchar`, not an `nvarchar`. Use `N'你好'`. If you try `SELECT '你好',N'你好';` you'll very quickly see the problem. – Thom A Jun 04 '18 at 15:31

1 Answers1

2

You are missing the N to mark a string literal as unicode:

 SELECT N'你好' AS unicode
       ,'你好' AS ASCII

Try this to get a base64 out of your chinese charcters and vice versa:

SELECT (SELECT CAST(N'你好' AS VARBINARY(MAX)) FOR XML PATH(''),TYPE).value(N'.','nvarchar(max)');

You get this base64 result: YE99WQ==

This is the way to re-convert the base64 to the original value

SELECT CAST(CAST('<x>' + 'YE99WQ==' + '</x>' AS XML).value('.','varbinary(max)') AS NVARCHAR(MAX));

UPDATE Some words about the re-encoding

base64 does not encode a string value, but the binary pattern a system uses to keep that string in memory (this is valid for any data type actually). The bit pattern of a string differs with UTF-8, UTF-16, ASCII whatever... And even worse there is BE and LE.

The steps to get base64 are:

  • Get the bit pattern of my value (a string, a date, a picture, any value actually)
  • compute the base64 for this bit pattern

The steps for the re-encoding are

  • Compute the original bit pattern which is hidden within the base64
  • Interpret the bit pattern as the original value

The very last step might bring up confusion... You have to know exactly which binary representation a system uses. You have to use exactly the same data type with exactly the same interpretation to get the values back.

With strings one has to know, that SQL-Server works with a very limited choice natively.

  • There is NVARCHAR (NCHAR), which is 2-byte encoded unicode in UCS-2 flavour (almost the same as utf-16)
  • And there is VARCHAR (CHAR), which is 1-byte encoded extended ASCII. All non-latin characters are bound to a code page within the connected collation. But this is not UTF-8!
Shnugo
  • 66,100
  • 9
  • 53
  • 114
  • I would add one thing to avoid confusion when some external app will decode values. SQL Server uses UTF-16 [How to decode base64 unicode string using T-SQL](https://stackoverflow.com/questions/25113655/how-to-decode-base64-unicode-string-using-t-sql) – Lukasz Szozda Jun 04 '18 at 16:08
  • @lad2025 Isn't the last part exactly doing this? – Shnugo Jun 04 '18 at 16:57
  • @Shnugo Yes, exactly inside SQL Server everything is OK (encode/decode). But if you encode in SQL Server and decode using for instance https://www.base64decode.org/ you will get different results. – Lukasz Szozda Jun 04 '18 at 17:01
  • 1
    @lad2025 is right, if i try to decode the string i've encoded with your system, with the javscript atob() function, i'm unable to decode it because the output is a string with "\u0000" between each char of the original decoded string – Mark Jun 05 '18 at 09:25
  • @Mark Read my update. Your *`\u0000` between each char* points to a writen *2-byte* string, which is read as *1-byte* string. – Shnugo Jun 05 '18 at 09:44
  • @Shnugo thanks for adding the explaination, but I can't find out how to correctly decode that string with javascript in documentations...do you have any idea? – Mark Jun 05 '18 at 10:01
  • @Mark You are dealing with binary values, not with strings... To avoid *cross-system* issues I'd implement an UDF in T-SQL. You might [read this related question](https://stackoverflow.com/q/2820249/5089204) – Shnugo Jun 05 '18 at 10:59
  • @Shnugo I've already put the code from your answer into an UDF but I can't see how this could help decoding the generated string in javascript... – Mark Jun 05 '18 at 13:23
  • @Mark Can't you call this function from your JS code? Pass in the base64 string and get back the clear text (all in UCS-2)? – Shnugo Jun 05 '18 at 13:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172526/discussion-between-mark-and-shnugo). – Mark Jun 05 '18 at 17:11