3

Inspired by the following two answers on stackoverflow I tried to implement a table with the goal to store User-Agent strings in it:

Here's my table structure:

CREATE TABLE IF NOT EXISTS ua_strings (
    ua_id INTEGER PRIMARY KEY AUTO_INCREMENT, 
    ua_hash BINARY(16), 
    ua TEXT, 
    UNIQUE KEY ua_hash (ua_hash)
);

I'd like to achieve the following:

  • Input: User-Agent string that should be inserted in the table only if it doesn't exist yet

  • Output: ua_id

So far I've come up with this solution:

INSERT IGNORE INTO ua_strings (ua_hash, ua) VALUES (UNHEX(MD5('test')), 'test');
SELECT ua_id FROM ua_strings WHERE ua_hash = UNHEX(MD5('test'));
  • Is it possible to make one query out of those two queries?
  • How can I improve my table structure or queries in terms of speed and elegance?
Community
  • 1
  • 1
  • Don't use MD5 even in an example, [it's compromised](https://en.wikipedia.org/wiki/MD5#Security).. Use [SHA1](https://dev.mysql.com/doc/refman/5.6/en/encryption-functions.html#function_sha1) or better. – Schwern Mar 17 '15 at 06:03
  • If I understand you correctly, given a User-Agent string you want a ua_id whether or not that User-Agent is already in the table? – Schwern Mar 17 '15 at 06:07
  • @Schwern. I know that MD5 is compromised. Collisions shouldn't be a problem. I don't need a cryptographically secure hash function. – renard volant Mar 17 '15 at 11:15
  • That's like saying "even though I have locks better in every way, I'm going to use this crappy, broken, old lock because I don't think this door is important". You're not a security expert (neither am I) and cannot predict which part of a system is vulnerable. Just put decent locks on everything. It's not like it's going to practically change the performance of anything to use MD5 vs SHA1. – Schwern Mar 17 '15 at 23:06
  • 4
    I'm guessing he's hashing it for faster lookups, not security, in which case a faster hashing algorithm that's not crypto-secure would be preferable. – Mason Stedman Aug 26 '18 at 20:58

1 Answers1

2

The most important thing is to get rid of the INSERT IGNORE. I discovered this will increment the primary key even if it fails. You can burn through your 4 billion keys very quickly this way. Do the SELECT first, it's going to be the most common case anyway.

My first thought is to put the logic into a database function. That gives you all the benefits of encapsulation. Then you can change how this works later.

My second is to get rid of that hash. It's effectively taking the place of an index on ua. Since you only need equivalence checks to be performant, a hash index would be ideal, but those are not supported on most MySQL table formats.

In its place I would use an index on the first 255 bytes of the User-Agent, that should be plenty for MySQL to do its job. This also gives you the benefit of a full index should you need to do something more than simple gets.

CREATE TABLE IF NOT EXISTS ua_strings (
    ua_id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT, 
    ua TEXT,
    KEY(ua(255))
);

The function would look like this (note, I am not the best at writing MySQL functions).

DELIMITER //
CREATE FUNCTION get_ua_id (ua_string TEXT)
RETURNS INTEGER
BEGIN
    DECLARE ret INTEGER;

    SELECT ua_id INTO ret FROM ua_strings WHERE ua = ua_string;

    /* It's not in the table, put it in the table */
    CASE WHEN ROW_COUNT() = 0 THEN
        INSERT INTO ua_strings (ua) VALUES (ua_string);
        SELECT LAST_INSERT_ID() INTO ret;
    ELSE BEGIN END;
    END CASE;

    RETURN ret;
END//
DELIMITER ;

A function with the hash would look very similar. Hide the implementation details in the function and benchmark the two.

And really don't use MD5. It's not going to affect performance to use SHA1, you can spare the extra 4 bytes per entry, and it will avoid hidden problems down the road. Using MD5 is like saying "Even though there's locks better in every way, I'll use this crappy lock because I don't think this door is important right now". You're not a security expert (neither am I) and don't know what parts are important and what aren't. Just put decent locks on everything. If SHA1 turns out to be some huge performance problem, you can always change it thanks to the encapsulation of the function.

No matter what the benchmarking results, I am willing to bet profiling will reveal your choice has no effect on the performance of whatever system this is a part of. Go with the simpler and more flexible indexed option, optimize it if it turns into a problem later.

Schwern
  • 153,029
  • 25
  • 195
  • 336