2

I am trying to match the sha1 values of certain data in the table in SQL Server and Snowflake.

I've got the sha1 for a latin character in SQL server in the following way-

select  sys.fn_varbintohexsubstring(0, HASHBYTES('SHA1',cast('á'  as varchar(1))),1,0) 

This returns b753d636f6ee46bb9242d01ff8b61f715e9a88c3

The sha1 function in Snowflake returns a different value for the same character.

select sha1(cast('á' as varchar))
Result - 2b9cc8d86a48fd3e4e76e117b1bd08884ec9691d

Note - The datatype in SQL Server is nvarchar while the datatype in Snowflake is varchar with default collation. For english characters, the sha1 values match after casting nvarchar to varchar. However, this is not the case with latin characters.

Is there a way to match sha1 values for non-english characters ? I need to get the value '2b9cc8d86a48fd3e4e76e117b1bd08884ec9691d' in SQL Server 2017 & below as it is what other databases like Oracle, Snowflake and Hive return.

Thanks

  • What is the default collation of the database where you run this? Or column collation, which you read it from, whichever the case. – Roger Wolf Aug 30 '21 at 14:38
  • 2
    The result depends on collation and data type: [db<>fiddle demo](https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c8d764690284cfb548551249631a2e70) – Lukasz Szozda Aug 30 '21 at 17:46
  • Default collation for the database as well as the said column is SQL_Latin1_General_CP1_CI_AS in SQL server with datatype nvarchar while the collation in snowflake is utf8 with datatype varchar – Happygolucky Aug 31 '21 at 06:42
  • select sys.fn_varbintohexsubstring(0, HASHBYTES('SHA1',cast(N'á' COLLATE LATIN1_GENERAL_100_CS_AS_SC_UTF8 as varchar(100))),1,0) works fine on dbfiddle. However, SQL server 2017 does not seem to support this UTF8 collation. Is there any alternatives here? – Happygolucky Aug 31 '21 at 07:47

2 Answers2

4

TL;DR: Never use varchar when calculating hashes. There are simply too many rakes you can step on in the process.

Just as an example, I adapted your code for easier understanding and run it in the context of a database which has Latin1_General_100_CI_AS default collation:

declare @a nchar(1) = N'á';
declare @b char(1) = cast(@a as char(1));

select @b as [Char], ascii(@b) as [A], unicode(@b) as [U], HASHBYTES('SHA1',@b) as [Hash]
union all
select @a, ascii(@a), unicode(@a), HASHBYTES('SHA1',@a);

The result is:

Char    A    U Hash
---- ---- ---- ------------------------------------------
á     225  225 0xB753D636F6EE46BB9242D01FF8B61F715E9A88C3
á     225  225 0xA4BCF633D5ECCD3F2A55CD0AD3D109A108A45F02

However, if I change the database context to another DB, with the Cyrillic_General_100_CI_AS collation, the same code suddenly returns different values:

Char    A    U Hash
---- ---- ---- ------------------------------------------
a      97   97 0x86F7E437FAA5A7FCE15D1DDCB9EAEAEA377667B8
á      97  225 0xA4BCF633D5ECCD3F2A55CD0AD3D109A108A45F02

As you can see, the [Char] in the first line is a different character now (small Latin "а"). This kind of implicit codepage adjustment cannot be prevented unless your data is in Unicode, or in a binary form.


Your options

  1. Upgrade to MS SQL Server 2019, or move to Azure SQL Database. Starting from this version, you can actually store strings in UTF-8 encoding, although you'll probably get a performance hit for that (whether it'll be noticeable or not, depends on your usage patterns).
  2. Calculate hashes externally (meaning, not in SQL). You can write a CLR function in C#, or something similar in Java (see Elliott Brossard's answer). This will increase complexity of your solution, and putting external code in your database might not be allowed by your company's policies, for example. Plus, maintaining external assemblies is usually a hassle.
Roger Wolf
  • 7,307
  • 2
  • 24
  • 33
  • Thank you Roger for the explanation. In this case, I am trying to compare the hash values of **á** across both the databases and I am trying to know why they give different hash results. I am looking for a way to match these values and I would like to know if it is possible either by changing the collation/datatype etc. My use case requires matching the hash values to establish that data is same. – Happygolucky Aug 31 '21 at 06:56
  • Convert them both to `nvarchar` *before* you hash then, @Happygolucky . – Thom A Aug 31 '21 at 08:02
  • 1
    @Larnu snowflake does not support nvarchar and the equivalent of this datatype in snowflake is varchar – Happygolucky Aug 31 '21 at 09:09
  • @Happygolucky, Snowflake docs state that `VARCHAR holds Unicode characters.`, so the only issue I see here is that it stores strings in UTF-8. Unless you have SQL Server 2019 or latest Azure equivalent, you can't use the same encoding for strings on this side. I'd recommend to understand what they call collations, seems to be a bit different concept - https://docs.snowflake.com/en/sql-reference/collation.html – Roger Wolf Aug 31 '21 at 09:30
  • @RogerWolf yes, I tried some permutations with different collations - `select sha1(cast('á' collate 'en' as varchar))`. It seems as though changing the collation does not affect the hash output in Snowflake. All of them seem to give the same output. – Happygolucky Aug 31 '21 at 10:00
  • @RogerWolf Option1 is not feasible as of now. Option 2 given in Elliott Brossard's answer returns the same output like SQL server. However, since my use case requires pushing the computation part to database and just comparing the results, it will increase the complexity of the solution as you rightly guessed. – Happygolucky Sep 01 '21 at 05:30
  • @Happygolucky, it takes a bit more to convert strings to UTF-8 in C#, as natively they are also stored in UTF-16 in .NET. However there are plenty of examples on SO on how to do that. And if you'll make a CLR function, it'll be executed inside SQL Server (can be called directly from your SQL code). – Roger Wolf Sep 01 '21 at 10:14
0

You can compute SHA1 hashes of Latin-1 strings using a Java UDF. Here is an example:

create function latin1sha1(str varchar)
returns varbinary language java handler = 'Latin1Sha1.compute' as $$
import java.io.UnsupportedEncodingException;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

class Latin1Sha1 {
  public byte[] compute(String str) throws NoSuchAlgorithmException, UnsupportedEncodingException {
    MessageDigest hash = MessageDigest.getInstance("SHA-1");
    hash.update(str.getBytes("ISO-8859-1"));  // AKA Latin-1
    return hash.digest();
  }
}
$$;

select hex_encode(latin1sha1('á'));

This returns B753D636F6EE46BB9242D01FF8B61F715E9A88C3.

Elliott Brossard
  • 32,095
  • 2
  • 67
  • 99
  • Doesn't look like you convert it to UTF-8. I receive this hash in SQL Server from `nvarchar` data. – Roger Wolf Sep 01 '21 at 04:56
  • @ElliottBrossard Yes, this is the value SQL Server 2017 and below versions return. I need to get the hash value `2b9cc8d86a48fd3e4e76e117b1bd08884ec9691d` which is what Snowflake, Oracle and Hive SQL queries return as the SHA1 output for the same `á`. The use case is to match the data between the databases by matching the hash values. – Happygolucky Sep 01 '21 at 05:23
  • Maybe I misunderstood what you were asking. I thought your goal was to get the same hash value in Snowflake as you are seeing in SQL Server. – Elliott Brossard Sep 01 '21 at 16:56
  • 1
    @ElliottBrossard. Ah, got it. The description was not very specific on the ask. Updated the same! Thanks. – Happygolucky Sep 03 '21 at 14:16