0

Suppose I have a table in SQL Server storing customer emails (a few million records) - For simplicity's sake, looking somewhat as follows:

CREATE TABLE [Emails]
(
    [Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
    [email] [nvarchar](1000) NOT NULL
)

And I have clients that send me lists, each millions of records long, of their customers' emails, but all MD5-Hash encrypted, so the list would simply look as follows:

0x3B46E0E53842A74172BA678974E93BBB
0xACAC5843E184C85AA6FF641AAB0AA644
0xD3C7BA16E02BE75142761894E8E4A125
...

And I have to come up with a fast way to see what emails from their list exist in my table.

Based upon some answers I have seen online / here, I came up with the following logic to do this:

  1. I created an indexed view of my Emails table with the MD5-Hash column as the index:

    CREATE VIEW dbo.vw_Emails
    WITH SCHEMABINDING
    AS
        SELECT 
            Id
             , email
             , CONVERT(VARBINARY(16), HASHBYTES('MD5', LOWER(email))) AS MD5
        FROM 
            dbo.Emails
    GO
    
    CREATE UNIQUE CLUSTERED INDEX Idx_vw_Emails ON vw_Emails (MD5)
    GO
    
  2. I created a stored procedure that will BulkImport the list given, convert it into a temporary table, join it against my view and return any matched rows as follows:

    CREATE PROCEDURE Import_ReturnMatches
    (
        @PathToCSVFile VARCHAR(8000)
    )
    
    AS
    
    DECLARE @fieldsep CHAR(1) = ',';
    DECLARE @recordsep CHAR(1) = CHAR(10);
    
    DECLARE @Emails TABLE 
    (
        MD5 VARCHAR(MAX) NOT NULL
    );
    
    DECLARE @sql VARCHAR(8000) = 
        'CREATE TABLE #tmp 
        (
              MD5 varchar(max) NOT NULL
        );
    
        BULK INSERT #tmp
        FROM ''' + @PathToCSVFile + '''
        WITH (FIRSTROW = 1, FIELDTERMINATOR = ''' + @fieldsep + ''', ROWTERMINATOR = ''' + @recordsep + ''');
    
        SELECT *
        FROM #tmp';
    
    INSERT INTO @Emails
    EXEC (@sql);
    
    SELECT 
        r.*
    FROM 
        @Emails l
        JOIN vw_Email_Dim r 
        ON l.MD5 = r.MD5
    

As you can see, I set the column type for my import as VARCHAR(MAX), but that was only because nothing else really worked... That's where I'm stuck. It seems to always be returning an empty set even though I've placed records in my file that should match.

My questions are:

  1. What am I doing wrong / how can I fix this?
  2. Am I using the right data-types for my storage / indexing / import?
  3. Is this just an overall bad idea and is there a better way to accomplish what I'm looking to do?
John Bustos
  • 19,036
  • 17
  • 89
  • 151
  • 1
    If at all possible I would add another column to your table which stores the MD5 hash of your email. It would certainly make the join faster. Also the case of the letters in the email may be causing grief. – thomas Apr 08 '18 at 20:30
  • Thanks, @thomas, I most probably will end up doing that, but this is still a test for my company, so they're not letting me change production tables until I prove this can work. It does make sense, though. – John Bustos Apr 08 '18 at 20:31
  • Comparing binary and varchar won't do what you want. – Martin Smith Apr 08 '18 at 20:34
  • @MartinSmith, I'm seeing that and that's much of the crux of my question - How can I accomplish this? – John Bustos Apr 08 '18 at 20:35
  • 1
    You need to use `CONVERT` with a style parameter to convert from varchar containing a hex string to varbinary or vice versa – Martin Smith Apr 08 '18 at 20:37
  • The very fact that you are doing this, shouldn’t you be trying to convince your boss/vendor why NOT to use md5. Think of all the other people that could quite easily do this should they get to your data... – DimUser Apr 08 '18 at 21:34
  • @DimUser, absolutely not my call there - I'm just the hired hand :) – John Bustos Apr 08 '18 at 22:40

2 Answers2

0

Your problem is likely this value:

LOWER(email)

If you aren't sure what case or encoding (Windows 1252, UTF8, UTF16, UTF16LE?) was used to generate the MD5 hash from the email at your source, then you are looking at needing to test all the combinations for a matching hash value. Consider where we change the LOWER to UPPER - a completely different MD5 hash value is generated:

MD5 hashes of emails with different casing

You will either need to control how the MD5 hash is generated at the source or add metadata (another field) to the import to describe how the input is cased and encoded.

codekaizen
  • 26,990
  • 7
  • 84
  • 140
  • I thought so too, but even when I put the hash generated by my view DIRECTLY into my file, I'm returning empty recordsets... – John Bustos Apr 08 '18 at 20:27
0

Check this answer. You need to compare varchar to varchar - not varbinary I think.

Generate MD5 hash string with T-SQL

thomas
  • 2,592
  • 1
  • 30
  • 44