7

SQL Server's hashing function HASHBYTES has an input limit of 8000 bytes.

How do you hash larger strings?

Paul Tyng
  • 7,924
  • 1
  • 33
  • 57
SDReyes
  • 9,798
  • 16
  • 53
  • 92

3 Answers3

9

You could write a SQL CLR function:

[Microsoft.SqlServer.Server.SqlFunction]
public static SqlBinary BigHashBytes(SqlString algorithm, SqlString data)
{
    var algo = HashAlgorithm.Create(algorithm.Value);

    var bytes = Encoding.UTF8.GetBytes(data.Value);

    return new SqlBinary(algo.ComputeHash(bytes));
}

And then it can be called in SQL like this:

--these return the same value
select HASHBYTES('md5', 'test stuff')
select dbo.BigHashBytes('md5', 'test stuff')

The BigHashBytes is only necessary if the length would be over 8k.

Paul Tyng
  • 7,924
  • 1
  • 33
  • 57
  • 7
    Careful with CLR SP parameters being silently truncated to 8000 bytes - I had to tag the parameter with `[SqlFacet(MaxSize = -1)]` otherwise bytes after the 8000th would simply be ignored! Took me a while to get that one! – randomdude Feb 06 '13 at 12:42
  • -1 for misleading readers (unintentionally, of course) into using UTF-8. SQL Server (and Windows in general) uses UTF-16. Hence this will appear to work for many people who are currently only using Code Points 0 - 127, or possibly up to 256. But using UTF-8 leads to problems such as this Question: [SQL CLR function based on .net ComputeHash is not working with Cyrrilic](http://stackoverflow.com/a/35273859/577765) – Solomon Rutzky Feb 08 '16 at 16:03
  • @randomdude Your experience was due to how older versions of SSDT generated the T-SQL wrapper objects. The default used to be to use `NVARCHAR(4000)` for `SqlString` and `NVARCHAR(MAX)` for `SqlChars`. But since around the time of Visual Studio 2013, the default was changed to use `NVARCHAR(MAX)` for both. Still, it is always better to be explicit and use `[SqlFacet()]` as you mention, but folks using the newer SSDT versions wouldn't run into that. Also, SQLCLR only support `NVARCHAR` so it would actually truncate at 4000 chars :-). – Solomon Rutzky Feb 08 '16 at 16:07
5

You could hash 8k (or 4k or 2k) chunks of the input and then either concatenate those hashes or hash them into a new hash value. This might get difficult though if you have to create a similar algorithm (in an external .NET app for example) to compare hashes created outside of SQL Server.

Another option: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly.

Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
  • 1
    thanks Paul, it's a really good solution for internal use. but in this case it has to be totally compliant with sha1 – SDReyes Oct 14 '11 at 16:28
  • 1
    @SDReyes: Another idea added to my answer: Lean on SQL Server's CLR integration and perform the hashing in a .NET assembly. – Paul Sasik Oct 14 '11 at 16:30
  • the CLR integration link in this answer is broken. – David Oct 13 '20 at 19:29
2

Like Paul's idea, one idea that comes to mind for chunking would be to store the hashed string in an XML column, with each chunk as a separate XML element.

Shan Plourde
  • 8,528
  • 2
  • 29
  • 42
  • +1 Good idea, though for data integrity's sake I would suggest a single hash value be stored (a hash of hashes.) With an XML structure you would need an extra attribute that maps a hash value to the index of the Nth chunk of data that was hashed. – Paul Sasik Oct 14 '11 at 16:08