1

Let's assume we have this table

ID | MAIL
1  | no@gmail.com
2  | yes@gmail.com
3  | anything@gmail.com

you get the point... I want to query this table now and I receive as input the MD5 hash of the MAIL column values. So my queries would look like this:

SELECT * FROM TABLE 
WHERE HashBytes('MD5', MAIL ) = 'CE42C51D0B63DF9F616CCFB4C3FCF16C'

Now imagine this table is a very big table, thousand of thousand of records. My query becomes really slow. What I want to do now is to add an index, but using MD5. I read it can be done on PostgreSQL using something like this:

CREATE INDEX mail_md5 ON TABLE(DECODE(MD5(MAIL), 'HEX'))

Can I do something similar in SQL Server?

Note: I can't add another column with the hash values :(

Note 2: I know a HASH is an irreversible function, the example for the PostgreSQL index was copied from here

Community
  • 1
  • 1
cgajardo
  • 364
  • 1
  • 2
  • 17
  • 1
    Can you at least add new tables to the database so you could join pre-calculated hashes to the mail addresses? – Scoregraphic Aug 18 '15 at 15:07
  • That seems very difficult too. – cgajardo Aug 18 '15 at 15:14
  • My main question is if it's possible or not. If it's not, then I can start looking for new solutions. – cgajardo Aug 18 '15 at 15:15
  • As for the index question: no it's not possible and function evaluation for big tables is slow. So you could have persisted computed columns, ordinary columns or an additional table as possible workaround, as hashing cannot be reversed. – Scoregraphic Aug 18 '15 at 15:17
  • A **hash** is a **non-reversible** operation - you **cannot** "decode" a hash, that's one of the mainstays of why you should be using hashes (and not reversible encryption) for user passwords in the first place .... – marc_s Aug 18 '15 at 15:37
  • I know i cant decode a hash, what make you though of that? if for the example of PostgreSQL, I just copy what I found here: http://stackoverflow.com/questions/2632347/sql-indexing-on-varchar – cgajardo Aug 18 '15 at 16:01
  • I don't understand your example (`HashBytes('MD5', MAIL )`). You say you don't have the MAIL column value. Why are you using it in this query then? – usr Aug 18 '15 at 16:06
  • I did not know you could do that in type of index on PostgreSQL. I don't know of an equivalent in sql-server. Can you add an indexed view? – paparazzo Aug 18 '15 at 16:12

1 Answers1

1

I'm not sure what you are allowed to do schema-wise, but if you can modify the schema then SQL Server has something called Indexed Views, which are views that are stored in memory (vs computed on the fly).

You can query the view instead of the underlying table, and SQL Server will keep it all up-to-date for you. The key phrase is WITH SCHEMABINDING, which tells SQL Server to keep the computed fields in memory.

For example:

CREATE VIEW HashedAddresses
WITH SCHEMABINDING
AS
SELECT ID, MAIL, HASHBYTES('MD5',MAIL) as HashedMailMD5 from myschema.mytable;

Then you can create a unique clustered index on your hash field:

CREATE UNIQUE CLUSTERED INDEX IndexHashedAddresses ON HashedAddresses(HashedMailMD5);

after which this should be fast:

SELECT ID FROM HashedAddresses WHERE HashedMailMD5 = '0x121....'

Issue: if you get an MD5 collision, the index will fail. Not sure what to do about that...

Community
  • 1
  • 1
Robert Calhoun
  • 4,823
  • 1
  • 38
  • 34
  • Thank you! I can workaround the collision including other column ;) (my example was a simplification of the real scheme). Then again, thank you, this should work. – cgajardo Aug 18 '15 at 18:06
  • You're welcome. The Microsoft link above claims you don't even need to query against the view, because SQL Server will automatically make use of it as an optimization. I'd be curious to know whether that works for you. – Robert Calhoun Aug 18 '15 at 19:26