2

I'm storing unique user-agents in a MySQL MyISAM table so when I have to look if it exists in the table, I check the md5 hash that is stored next to the TEXT field.

User-Agents 
{
    id         - INT
    user-agent - TEXT
    hash       - VARCHAR(32) // md5
}

There is any way to do the same but using a 32-bit integer and not a text hash? Maybe the md5 in raw format will be faster? That will requiere a binary search.

[EDIT]

MySQL don't handle hash searches for complete case-sensitive strings?

Wiliam
  • 3,714
  • 7
  • 36
  • 56
  • 1
    [MySQL DOES handle case sensitive searches](http://dev.mysql.com/doc/refman/5.0/en/case-sensitivity.html), if either your table (or column) is defined as case sensitive or you specifiy collation in your query (the first option will be faster then the second of course). – wimvds Aug 24 '10 at 08:18
  • "so when I have to look if it exists in the table, I check the md5 hash that is stored next to the TEXT field." - why? Are you sure (i.e. have you tested) that computing the hash and then search for it is faster/better than simply searching for the string itself (with an appropriate index)? Or is there another reason why you have to search for the hash instead of the plain text? – VolkerK Aug 24 '10 at 08:35
  • @VolkerK: The string is a TEXT field. There are one million reasons to use the hash. For example, each time I have to check if the string exists I have to escape the string, send it through the network and MySQL have to generate a index record for a big text field each time it's inserted. For example. – Wiliam Aug 24 '10 at 08:54
  • "a big text field" - ah, you didn't say that. Are we talking megabytes here? You can avoid escaping with parametrized, server-side prepared statements. And collisions _can_ happen, more so if you switch to an algorithm that produces only 32bit results. So you'd have to test the actual contents anyway to be on the safe side. And computing a hash isn't exactly free, too. It maybe quite fast but it's not done in 0-time. – VolkerK Aug 24 '10 at 09:11

9 Answers9

8

Store the UNHEX(MD5($value)) in a BINARY(16).

Cosmin
  • 21,216
  • 5
  • 45
  • 60
druid
  • 81
  • 1
  • 1
5

You could do this instead:

User-Agents 
{
    id         - INT
    user-agent - TEXT
    hash       - UNSIGNED INT (CRC32, indexed)
}


$crc32 = sprintf("%u", crc32($user_agent));

SELECT * FROM user_agents WHERE hash=$crc32 AND user_agent='$user_agent';

It's unlikely that you'll get collisions with crc32 for this kind of data.

To guarantee that collisions will not cause problems, add a secondary search parameter. MySQL will be able to use the index to quickly find the record. Then it can do a simple string search to guarantee that match is correct.

PS: The sprintf() is there to work around signed 32-bit integers. Should be unnecessary on 64-bit systems.

Matthew
  • 47,584
  • 11
  • 86
  • 98
  • I like this solution, but another check column will requiere a md5 16-binary-bytes hash right? Only to use when 2 "crc32-hash" records have the same value. – Wiliam Aug 24 '10 at 08:29
  • There is no need for a second index or a md5 with this approach. MySQL can only use one index per query. It will use the crc index to find any possible records. After it has found them (usually 0 or 1 record), it will do a string search on the user agent to make sure it isn't a false positive (Collision). – Matthew Aug 24 '10 at 08:40
4

Let MySQL do the hard work for you. Use a CHAR column and create an index on that column. You could convert and store the hash as an integer, but there's absolutely no benefit, and it may actually cause problems.

cbednarski
  • 11,718
  • 4
  • 26
  • 33
  • 3
    Why varchar, not char for fixed-width string? – Māris Kiseļovs Aug 24 '10 at 07:38
  • @Māris Kiseļovs Updated--Thanks for the tip. – cbednarski Aug 24 '10 at 07:51
  • 1
    Fixed CHAR better than VARCHAR. But, 32 bytes CHAR is bigger than 16 bytes BINARY, so will have bigger indexes values and table will be bigger, so performance will be in some cases 200% slower for that index right? I'm working with hypothetic 50M rows tables. – Wiliam Aug 24 '10 at 07:57
3

try MurmurHash. Its a fast hashing algo thats been translated to multiple languages. It takes your input and translates it into a 32/64 bit integer hash.

ebt
  • 1,358
  • 1
  • 13
  • 18
  • That sounds good, but a php implementation will not be as fast as the native php md5 right? And the hash hasn't been test enough yet I think. – Wiliam Aug 24 '10 at 08:41
  • 2
    I've use it in production to create unique (UUID4 unique) hashes with no collisions at 100M+ records so far. Speed depends on implementation, C version blazes, I can't speak for PHP. – ebt Aug 24 '10 at 09:09
2

You can't store an MD5 hash in a 32-bit int: it simply won't fit. (It's 32 characters when written in hex, but it's 128-bits of data)

You could look at MySQL's BINARY and VARBINARY types. See http://dev.mysql.com/doc/refman/5.1/en/binary-varbinary.html. These types store binary data. In your case, BINARY(16) or VARBINARY(16), but since MD5 hashes are always 16 bytes, the latter seems a bit pointless.

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • BINARY(16) better and I know I can't store a md5 in a 32-bit integer, what I was asking is for another solution that fits in a 32-bit integer or in 2 fields. Thanks – Wiliam Aug 24 '10 at 08:01
1

You can store MD5 hash in char(32) which is a bit faster than varchar(32). It's also possible to make two BIGINT fields and keep first half of md5 hash in first field and second part in second field.

Māris Kiseļovs
  • 16,957
  • 5
  • 41
  • 48
0

Are you REALLY sure the hashes are only 32-bit? MD5 is 128-bit. Cropping the hash to first 4 or 8 bytes would greatly increase risk of collisions.

Matěj Zábský
  • 16,909
  • 15
  • 69
  • 114
0

If your field hash is always an MD5 value generated by PHP, then you can safely set it to CHAR(32). This should not impact the response time to your queries, unless you plan to have millions+ of rows, or even worst! JOIN other tables with this field. The bottom line is that fixed width column is better than variable ones, so if you can optimize do it.

Regarding changing MD5 into int values, see this question; the conclusion to this is that if you really want to change your MD5 into a 128-bit int value, you might as well use a random number instead of an MD5!

Community
  • 1
  • 1
Yanick Rochon
  • 51,409
  • 25
  • 133
  • 214
  • I'm working with hypothetic tables of 50M records. Thinking small is an error. – Wiliam Aug 24 '10 at 07:58
  • then use a random generated number instead of an MD5. With 32-bit integers, you have plenty of range to play with, even with 500M rows. – Yanick Rochon Aug 24 '10 at 08:06
  • It can't be random, it must be the hash of the string field. So If I need to find the text in the string field I have only to search the hash. – Wiliam Aug 24 '10 at 08:30
  • you should put that field as UNIQUE then to avoid any possible collision (even for slim chances) and since this field will apparently only be used for searching purposes, then CHAR(32) will be your best choice. You could even go for SHA-1, which generates a 40 characters long string and would still be good in response time since indexed strings are stored in binary trees... – Yanick Rochon Aug 24 '10 at 14:04
0

Have you tried creating a BINARY(16) field, and storing the result of md5($plaintext, true); in it? That might work, make sure you index that field as well.

Because trying to fit a 128-bit value in 32 bits doesn't make any sense...

wimvds
  • 12,790
  • 2
  • 41
  • 42