1

I am just designing a database with email addresses (for unsubscribing option). Because IMHO indexing a larger varchar field is not a best idea so I am thinking about a creating int hashcode field, filling it with hashcodes of email addresses and creating index on this field instead of email field directly. This way the lookup in database would look like:

SELECT TOP 1 NULL FROM tbUnsub WHERE emailhash=-5421215 AND emailaddress='just.a@sample.com'

The question is, if it will need less data (because of missing index on larger varchar field) and if it will work faster because of searching in int field.

Thank you for your help in advance!

HoGo
  • 147
  • 2
  • 17
  • If you can tolerate collisions, then it might be faster. I sugegst you benchmark in your particular circumstances. – Mitch Wheat Nov 28 '14 at 08:43
  • As I see there will be no collisions, "same hash on different string problem" will be solved with looking after exact email address as next step. – HoGo Nov 28 '14 at 08:47
  • How are you guaranteeing no collisions??? – Mitch Wheat Nov 28 '14 at 08:59
  • Maybe we are talking about different things, could you please explain a bit what collisions do you have on mind? – HoGo Nov 28 '14 at 09:03

2 Answers2

0

So, after a short tests:

Data storage - in case of almost 1000000 records in database:

Table without hash field: 31.641MB data, 36.742 MB index - together 68,383 MB Table with hash field: 35.367 MB data, 16.859 MB index - together 52,226 MB

So even if you have to store data for one more field, it will need less storage space because of less space necessary for index on int field.

Performance on searching in database:

There is a big difference, if you are looking mostly for existing or not existing records:

0 existing, 2000 not existing: 18.678 s without hash, 6.620 s with hash

1000 existing, 1000 not existing: 10.815 s without hash, 5.054 s with hash

2000 existing, 0 not existing: 1.782 s without hash, 2.909 s with hash

So if you are looking for mostly existing record, your best option is the built in one, do not spend time on playing with hash. It can be an option if you will be looking mostly for data not in database.

Just an interesting thing: it is much more faster to look only after hash in sql directly in database, and check if email address is equal in local application (times: 2.286, 1.972, 2.070)

HoGo
  • 147
  • 2
  • 17
0

In general it's not a good idea, as a hash isn't guaranteed to be unique. You therefore run the risk of selecting multiple entries, or the wrong entry if you query solely by hash value. See this question for a short explanation. You have a 50% likelihood to duplicate a key at around 54,000 records (for integer keys), which is perhaps a surprisingly low number.

Community
  • 1
  • 1
MonkeyPushButton
  • 1,077
  • 10
  • 19
  • That record count seems to be too low... Of course I know about problems with hash duplicity, this is the reason I added original varchar email address to where condition - or doing additional check localy, not on sql server. – HoGo Nov 28 '14 at 16:28
  • Well, beside of solution, just checked it, there are 569 double hashes on 961534 total count, so I have one double on each of 1690 record. Strange. – HoGo Nov 28 '14 at 16:34
  • 1
    See the [Birthday Problem](http://en.wikipedia.org/wiki/Birthday_problem) for an explanation of why you get more collisions than you might expect. – MonkeyPushButton Dec 01 '14 at 17:46