Suppose I have a table in SQL Server storing customer emails (a few million records) - For simplicity's sake, looking somewhat as follows:
CREATE TABLE [Emails]
(
[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[email] [nvarchar](1000) NOT NULL
)
And I have clients that send me lists, each millions of records long, of their customers' emails, but all MD5-Hash
encrypted, so the list would simply look as follows:
0x3B46E0E53842A74172BA678974E93BBB
0xACAC5843E184C85AA6FF641AAB0AA644
0xD3C7BA16E02BE75142761894E8E4A125
...
And I have to come up with a fast way to see what emails from their list exist in my table.
Based upon some answers I have seen online / here, I came up with the following logic to do this:
I created an indexed view of my
Emails
table with theMD5-Hash
column as the index:CREATE VIEW dbo.vw_Emails WITH SCHEMABINDING AS SELECT Id , email , CONVERT(VARBINARY(16), HASHBYTES('MD5', LOWER(email))) AS MD5 FROM dbo.Emails GO CREATE UNIQUE CLUSTERED INDEX Idx_vw_Emails ON vw_Emails (MD5) GO
I created a stored procedure that will
BulkImport
the list given, convert it into a temporary table, join it against my view and return any matched rows as follows:CREATE PROCEDURE Import_ReturnMatches ( @PathToCSVFile VARCHAR(8000) ) AS DECLARE @fieldsep CHAR(1) = ','; DECLARE @recordsep CHAR(1) = CHAR(10); DECLARE @Emails TABLE ( MD5 VARCHAR(MAX) NOT NULL ); DECLARE @sql VARCHAR(8000) = 'CREATE TABLE #tmp ( MD5 varchar(max) NOT NULL ); BULK INSERT #tmp FROM ''' + @PathToCSVFile + ''' WITH (FIRSTROW = 1, FIELDTERMINATOR = ''' + @fieldsep + ''', ROWTERMINATOR = ''' + @recordsep + '''); SELECT * FROM #tmp'; INSERT INTO @Emails EXEC (@sql); SELECT r.* FROM @Emails l JOIN vw_Email_Dim r ON l.MD5 = r.MD5
As you can see, I set the column type for my import as VARCHAR(MAX)
, but that was only because nothing else really worked... That's where I'm stuck. It seems to always be returning an empty set even though I've placed records in my file that should match.
My questions are:
- What am I doing wrong / how can I fix this?
- Am I using the right data-types for my storage / indexing / import?
- Is this just an overall bad idea and is there a better way to accomplish what I'm looking to do?