SQL Server - Compare large number of MD5-Hashed Values

Question

Suppose I have a table in SQL Server storing customer emails (a few million records) - For simplicity's sake, looking somewhat as follows:

CREATE TABLE [Emails]
(
    [Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
    [email] [nvarchar](1000) NOT NULL
)

And I have clients that send me lists, each millions of records long, of their customers' emails, but all MD5-Hash encrypted, so the list would simply look as follows:

0x3B46E0E53842A74172BA678974E93BBB
0xACAC5843E184C85AA6FF641AAB0AA644
0xD3C7BA16E02BE75142761894E8E4A125
...

And I have to come up with a fast way to see what emails from their list exist in my table.

Based upon some answers I have seen online / here, I came up with the following logic to do this:

I created an indexed view of my Emails table with the MD5-Hash column as the index:

CREATE VIEW dbo.vw_Emails
WITH SCHEMABINDING
AS
    SELECT 
        Id
         , email
         , CONVERT(VARBINARY(16), HASHBYTES('MD5', LOWER(email))) AS MD5
    FROM 
        dbo.Emails
GO

CREATE UNIQUE CLUSTERED INDEX Idx_vw_Emails ON vw_Emails (MD5)
GO

I created a stored procedure that will BulkImport the list given, convert it into a temporary table, join it against my view and return any matched rows as follows:

CREATE PROCEDURE Import_ReturnMatches
(
    @PathToCSVFile VARCHAR(8000)
)

AS

DECLARE @fieldsep CHAR(1) = ',';
DECLARE @recordsep CHAR(1) = CHAR(10);

DECLARE @Emails TABLE 
(
    MD5 VARCHAR(MAX) NOT NULL
);

DECLARE @sql VARCHAR(8000) = 
    'CREATE TABLE #tmp 
    (
          MD5 varchar(max) NOT NULL
    );

    BULK INSERT #tmp
    FROM ''' + @PathToCSVFile + '''
    WITH (FIRSTROW = 1, FIELDTERMINATOR = ''' + @fieldsep + ''', ROWTERMINATOR = ''' + @recordsep + ''');

    SELECT *
    FROM #tmp';

INSERT INTO @Emails
EXEC (@sql);

SELECT 
    r.*
FROM 
    @Emails l
    JOIN vw_Email_Dim r 
    ON l.MD5 = r.MD5

As you can see, I set the column type for my import as VARCHAR(MAX), but that was only because nothing else really worked... That's where I'm stuck. It seems to always be returning an empty set even though I've placed records in my file that should match.

My questions are:

What am I doing wrong / how can I fix this?
Am I using the right data-types for my storage / indexing / import?
Is this just an overall bad idea and is there a better way to accomplish what I'm looking to do?

If at all possible I would add another column to your table which stores the MD5 hash of your email. It would certainly make the join faster. Also the case of the letters in the email may be causing grief. — thomas, Apr 08 '18 at 20:30
Thanks, @thomas, I most probably will end up doing that, but this is still a test for my company, so they're not letting me change production tables until I prove this can work. It does make sense, though. — John Bustos, Apr 08 '18 at 20:31
@MartinSmith, I'm seeing that and that's much of the crux of my question - How can I accomplish this? — John Bustos, Apr 08 '18 at 20:35
You need to use `CONVERT` with a style parameter to convert from varchar containing a hex string to varbinary or vice versa — Martin Smith, Apr 08 '18 at 20:37
The very fact that you are doing this, shouldn’t you be trying to convince your boss/vendor why NOT to use md5. Think of all the other people that could quite easily do this should they get to your data... — DimUser, Apr 08 '18 at 21:34
@DimUser, absolutely not my call there - I'm just the hired hand :) — John Bustos, Apr 08 '18 at 22:40

score 0 · Answer 1 · answered Apr 08 '18 at 20:24

Your problem is likely this value:

LOWER(email)

If you aren't sure what case or encoding (Windows 1252, UTF8, UTF16, UTF16LE?) was used to generate the MD5 hash from the email at your source, then you are looking at needing to test all the combinations for a matching hash value. Consider where we change the LOWER to UPPER - a completely different MD5 hash value is generated:

You will either need to control how the MD5 hash is generated at the source or add metadata (another field) to the import to describe how the input is cased and encoded.

I thought so too, but even when I put the hash generated by my view DIRECTLY into my file, I'm returning empty recordsets... — John Bustos, Apr 08 '18 at 20:27

score 0 · Accepted Answer · answered Apr 08 '18 at 20:36

0

Check this answer. You need to compare varchar to varchar - not varbinary I think.

Generate MD5 hash string with T-SQL

answered Apr 08 '18 at 20:36

thomas

2,592
1
30
44

I tried this for my join: `@Emails l JOIN vw_Emails r ON l.MD5 = CONVERT(VARCHAR(MAX), r.MD5, 2)`, but still no luck... Am I doing something wrong? – John Bustos Apr 08 '18 at 20:38
1

Can you change the view to convert to varchar(32)? – thomas Apr 08 '18 at 20:39
You need to show us some example data where it fails and you expect it to succeed ( on both sides). I.e. provide an MVCE – Martin Smith Apr 08 '18 at 20:41
1

@john You need style parameter 1 to preserve the `0x` – Martin Smith Apr 08 '18 at 20:48
I just tried this - **IT WORKED** - Between the storage column issue and my insane stupidity at the `ROWTERMINATOR` - I needed to make it `\n`! – John Bustos Apr 08 '18 at 20:54

SQL Server - Compare large number of MD5-Hashed Values

2 Answers2