I am trying to run a one-way hash for a string column in USQL. Is there a way to do this inline? Most of the C# samples found online require multiple lines of code - which is tricky in USQL without a code-behind or compiled C# assembly.
Asked
Active
Viewed 733 times
0
-
I hope this isn't for passwords. – tadman May 29 '18 at 18:41
-
1Nope. Not for passwords! :) – aaronsteers May 29 '18 at 19:26
2 Answers
2
Option 1 (Inline formula):
The code below can be used to compile a SHA256 or MD5 on any string, and runs without any special dependencies and without needing a code-behind file.
CREATE TABLE master.dbo.Test_MyEmail_Hashes AS
SELECT
cust.CustEmailAddr AS Email
, String.Concat(System.Security.Cryptography.SHA256.Create()
.ComputeHash(Encoding.UTF8.GetBytes(
cust.CustEmailAddr))
.Select(item => item.ToString("x2")))
AS Email_SHA2
, String.Concat(System.Security.Cryptography.MD5.Create()
.ComputeHash(Encoding.UTF8.GetBytes(
cust.CustEmailAddr))
.Select(item => item.ToString("x2")))
AS Email_MD5
FROM master.dbo.Customers AS cust
;
Option 2 (using Lambda functions): (UPDATED)
Thanks to @MichaelRys for the pointer that USQL now supports Lambda functions and can be cleaned up as in the below:
// Generic get_hash() function
DECLARE @get_hash Func<string,System.Security.Cryptography.HashAlgorithm,string> =
(raw_value, hasher) => String.Concat(hasher.ComputeHash(Encoding.UTF8.GetBytes(raw_value)));
// Short-hand functions for MD5 and SHA256:
DECLARE @md5 = System.Security.Cryptography.MD5.Create();
DECLARE @get_md5 Func<string,string> =
(raw_value) => @get_hash(raw_value, @md5);
DECLARE @sha256 = System.Security.Cryptography.SHA256.Create();
DECLARE @get_sha256 Func<string,string> =
(raw_value) => @get_hash(raw_value, @sha256);
// Core query:
CREATE TABLE master.dbo.Test_MyEmail_Hashes AS
SELECT
cust.CustEmailAddr AS Email
, @get_sha256(cust.CustEmailAddr) AS Email_SHA2
, @get_md5(cust.CustEmailAddr) AS Email_MD5
FROM master.dbo.Customers AS cust

aaronsteers
- 2,277
- 2
- 21
- 38
-
Related: https://stackoverflow.com/questions/16999361/obtain-sha-256-string-of-a-string/50591182#50591182 – aaronsteers May 29 '18 at 19:26
1
Actually I suggest you use the recently added "named lambdas" (Func<> typed variables) to use the multi-line C# samples. An example is here: https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2018/2018_Spring/USQL_Release_Notes_2018_Spring.md#u-sql-adds-c-func-typed-variables-in-declare-statements-named-lambdas

Michael Rys
- 6,684
- 15
- 23
-
1
-
Cool! you can also package these variables into a U-SQL package for reuse and import them when needed. – Michael Rys May 29 '18 at 19:27
-
Saw that, but haven't checked it out yet. Will be interested to try it. – aaronsteers May 29 '18 at 19:27
-
Performance of this seems fine for my use case but I'm curious if there's any way to run "MD5.Create()" outside of the context of the inline function so that the MD5 object doesn't have to be recreated on each row? Or would USQL locate this redundancy and optimize automatically? – aaronsteers May 29 '18 at 19:30
-
You could run it once per script and assign it to a variable and then pass it into the function variable as a parameter. We do not snapshot C# expressions inside lambdas. – Michael Rys May 29 '18 at 19:44
-
1Good to know. I've updated my answer to declare the objects at top of script. This seems to provide a mild performance boost. – aaronsteers May 29 '18 at 20:29