3

I'm using a .Net DLL to hash strings so later I can compare them.

I've wrapped the call to that function inside a SQL Server CLR UDF and published on the server.

Now, when I execute the function the output is different than the one I get when running a Console Application.

Function signature is as follows:

[Microsoft.SqlServer.Server.SqlFunction]
public static SqlInt32 CalculateHash(SqlString input)
{
    string unwrappedValue = input.Value.Normalize();
    int hash = HashCalculator.Calculate(unwrappedValue);
    return new SqlInt32(hash);
}

As you can see, I'm unwrapping and normalizing the string before even calculating the hash. So, I would expect the results to be the same no matter where I'm calling that code from.

Given the string Test 123 i'm getting:

-387939562 - When running from a Console Application
137570918  - When calling from SQL Server

SQL Server UDF does not allow Debug.Print (or similar) calls.

And for some reason, Visual Studio won't stop at the *.cs files breakpoints when debugging the UDF (but that's a different problem I'm still trying to igure out).

My question is: Why would the same function give two different results? What goes with the UDF that could be causing this? I even tried changing collation from the database but it does not affect the function's result.

Edit: I managed the step into the code when running the sql query and found out that the method String.GetHashCode() is returning a different value when running inside SQL Server. In any case, I'd assumed any charset-like problem would go away since I normalize the string before using it.

Edit 2: Given that GetHashCode seemed to be the problem, I've checked the code for it here:

How is GetHashCode() of C# string implemented?

And found out that the .net implementation differs from running 32bits x 64bits. When I put my Console to run in 64bit mode, the output result is the same I got in SQL Server.

Community
  • 1
  • 1
Rodrigo Lira
  • 1,489
  • 3
  • 14
  • 28
  • 1
    Are you using `varchar` or `nvarchar` in SQL? – Matthew Whited Oct 14 '16 at 18:33
  • Either running the sql query as select myFunction('Test 123') and myFunction(N'Test 123') outputs the same result. I believe this is varchar, right? In any case I assumed I would not have problems with charsets since I'm normalizing the string before usage. Also, see my edit as I seem to have pin pointed the problem. Thanks. – Rodrigo Lira Oct 14 '16 at 18:45
  • What does `.Normalize` do? – Lasse V. Karlsen Oct 14 '16 at 19:14
  • 1
    You can't rely on `GetHashCode` being stable other that in the current process. – Lasse V. Karlsen Oct 14 '16 at 19:15
  • @MatthewWhited There really is no `VARCHAR` when using SQLCLR. Only NVARCHAR is allowed in the API. VARCHAR data will be implicitly converted to NVARCHAR on the way in. And a string of "Test 123" would be the same in both 8-bit Extended ASCII and Unicode anyway. – Solomon Rutzky Oct 14 '16 at 19:19
  • 1
    @LasseV.Karlsen `Normalize()` depending on which of the 4 normalization forms is chosen, is generally used to make the following two sequences the same: `SELECT NCHAR(0x00FC) AS [ü], N'u' + NCHAR(0x0308) AS [u + combining diaeresis];`. Normalization can either break the single character into the other two, or in the other direction, will combine the 2 code points into the singular form. Also, regarding `GetHashCode`, good point! It could even return a different value between console app and SQL Server if SQL Server is 2005, 2008, or 2008 R2 due to being tied to CLR 2.0. – Solomon Rutzky Oct 14 '16 at 19:25
  • UTF8 and Unicode/UTF16 are not the same thing. ASCII and UTF8 would look the same as long as you aren't using extended charaters. NVARCHAR is double byte UTF16. It is also possible that BOM or some other unprinted characters that you wouldn't notice to effect the calculation. – Matthew Whited Oct 14 '16 at 21:16
  • @MatthewWhited Yes, UTF8 and UTF16 are different. Yes, UTF8 is the same as ASCII and the lower-half of Extended ASCII. `NVARCHAR` is UTF-16 Little Endian (there's no plain UTF-16, it's either Big Endian or Little Endian; unspecified is implied Big Endian). Still, nobody mentioned UTF-8 or Byte Order Marks. And neither of those are relevant here since .NET strings are only UTF-16 LE. If you want UTF-8, UTF-32, UTF-16 Big Endian, ASCII, or any 8-bit Extended ASCII Code Page, then you will be dealing with a `byte[]`. `NVARCHAR` is only UTF-16 LE. Which is why none of this was the issue. – Solomon Rutzky Oct 15 '16 at 06:17
  • Strings can contain unprinted characters as I noted. Without seeing what his calculated has does or the actual strings in the debugger you have no idea what is going on. – Matthew Whited Oct 15 '16 at 13:11

1 Answers1

1

Once I managed to debug the UDF running on SQL Server I was finally able to figure out that the algorithm I'm using to hash a given string relies on .net's GetHashCode method.

And at least for the String class, as per this question, the original implementation gives different results for 32 and 64 bit platforms.

So, when I was running on the Console application to test the functionality, it ended up running as a 32 bit application. But SQL Server installed is 64 bits, forcing the other implementation for GetHashCode to run.

I was able to replicate SQL Server's results by forcing the Console application to run as a 64bit application.

Then, it was only a matter of tweaking the parameters so that 32bit and 64bit platforms give a similar (but not equal) result.

Community
  • 1
  • 1
Rodrigo Lira
  • 1,489
  • 3
  • 14
  • 28