8

I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.

The C# app that saves these records uses the following method to do the hashing:

public static string CreateMd5HashString(byte[] input)
{
    var hashBytes = MD5.Create().ComputeHash(input);
    return string.Join("", hashBytes.Select(b => b.ToString("X")));
}

In order to call this, I first convert the string to byte[] using the UTF-8 encoding:

// this is what I use in my app
CreateMd5HashString(Encoding.UTF8.GetBytes("abc"))
// result: 90150983CD24FB0D6963F7D28E17F72

Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES function, but I get a different value:

print hashbytes('md5', N'abc')
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315

This is because SQL computes the MD5 of the UTF-16 representation of the string. I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc")).

I cannot change the way hashing is done in the application.

Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?

I looked up similar questions, I tried using collations, but had no luck so far.

Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
  • I just did kinda the same thing last night .. i guess you use it to store passwords and check for login ... why don't you change your logic to let C# use MD5 and convert it again to hash and then check if it's same string that you stored in your DB? – Veljko89 Feb 04 '16 at 12:03
  • 1
    @Veljko89 MD5 is [not suited](http://security.stackexchange.com/a/19908/4304) to be used for passwords. I recommend that you avoid it. – Cristian Lupascu Feb 04 '16 at 12:05
  • But to actually test it on any website there are defenses against it, timeout after 5 tries or something ... no website can process that amount of logins. And even to find someones password, what are chances to anyone figure out 20 char string added as salt? – Veljko89 Feb 04 '16 at 12:07
  • @Veljko89 Yes, but in case an attacker gets the contents of the DB (via an SQLI vulnerability for example) it would be easy to get (at least some of) the passwords. – Cristian Lupascu Feb 04 '16 at 12:11
  • @Veljko89 I'd like to change the app as a last resort, as that is pretty hard to do. I'm interested if there's an SQL solution to this. – Cristian Lupascu Feb 04 '16 at 12:13
  • I personally don't know how i would do it on SQL side, but if you reach that point that you have to change app check answer on this question http://stackoverflow.com/questions/27908449/tsql-md5-hash-different-to-c-sharp-net-md5 – Veljko89 Feb 04 '16 at 12:21
  • You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. I'll dig one out. – Ben Feb 08 '16 at 18:02

3 Answers3

14

You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called dbo.NCharToUTF8Binary then you can do:

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))

Here is a UDF which will do that:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit)
returns varbinary(max)
as
begin
-- Note: This is not the fastest possible routine. 
-- If you want a fast routine, use SQLCLR
    set @modified = isnull(@modified, 0)
    -- First shred into a table.
    declare @chars table (
    ix int identity primary key,
    codepoint int,
    utf8 varbinary(6)
    )
    declare @ix int
    set @ix = 0
    while @ix < datalength(@txt)/2  -- trailing spaces
    begin
        set @ix = @ix + 1
        insert @chars(codepoint)
        select unicode(substring(@txt, @ix, 1))
    end

    -- Now look for surrogate pairs.
    -- If we find a pair (lead followed by trail) we will pair them
    -- High surrogate is \uD800 to \uDBFF
    -- Low surrogate  is \uDC00 to \uDFFF
    -- Look for high surrogate followed by low surrogate and update the codepoint   
    update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF
    and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF
    -- Get rid of the trailing half of the pair where found
    delete c2 
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0x10000

    -- Now we utf-8 encode each codepoint.
    -- Lone surrogate halves will still be here
    -- so they will be encoded as if they were not surrogate pairs.
    update c 
    set utf8 = 
    case 
    -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding)
    when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0)
    then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6))
    -- Two-byte encodings
    when codepoint <= 0x07ff
    then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Three-byte encodings
    when codepoint <= 0x0ffff
    then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Four-byte encodings 
    when codepoint <= 0x1FFFFF
    then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)

    end
    from @chars c

    -- Finally concatenate them all and return.
    declare @ret varbinary(max)
    set @ret = cast('' as varbinary(max))
    select @ret = @ret + utf8 from @chars c order by ix
    return  @ret

end
Ben
  • 34,935
  • 6
  • 74
  • 113
2

SQL Server does not natively support using UTF-8 strings, and it hasn't for quite a while. As you noticed, NCHAR and NVARCHAR use UCS-2 rather than UTF-8.

If you are insistent on using the HASHBYTES function, you must be able to pass the UTF-8 byte[] as VARBINARY from your C# code to preserve the encoding. HASHBYTES accepts VARBINARY in place of NVARCHAR. This could be accomplished with a CLR function that accepts NVARCHAR and returns the results of Encoding.UTF8.GetBytes as VARBINARY.

With that being said, I strongly suggest keeping these types of business rules isolated within your application rather than the database. Especially since the application is already performing this logic.

Taylor Buchanan
  • 4,155
  • 1
  • 28
  • 40
2
SELECT HashBytes('MD5', CAST (N'中文' COLLATE Latin1_General_100_CI_AI_SC_UTF8 AS varchar(4000)))

It only on sql server 2019

reference: https://www.mssqltips.com/sqlservertip/6168/impact-of-utf8-support-in-sql-server-2019/

TT.
  • 15,774
  • 6
  • 47
  • 88