The most important thing is to get rid of the INSERT IGNORE
. I discovered this will increment the primary key even if it fails. You can burn through your 4 billion keys very quickly this way. Do the SELECT
first, it's going to be the most common case anyway.
My first thought is to put the logic into a database function. That gives you all the benefits of encapsulation. Then you can change how this works later.
My second is to get rid of that hash. It's effectively taking the place of an index on ua. Since you only need equivalence checks to be performant, a hash index would be ideal, but those are not supported on most MySQL table formats.
In its place I would use an index on the first 255 bytes of the User-Agent, that should be plenty for MySQL to do its job. This also gives you the benefit of a full index should you need to do something more than simple gets.
CREATE TABLE IF NOT EXISTS ua_strings (
ua_id INTEGER UNSIGNED PRIMARY KEY AUTO_INCREMENT,
ua TEXT,
KEY(ua(255))
);
The function would look like this (note, I am not the best at writing MySQL functions).
DELIMITER //
CREATE FUNCTION get_ua_id (ua_string TEXT)
RETURNS INTEGER
BEGIN
DECLARE ret INTEGER;
SELECT ua_id INTO ret FROM ua_strings WHERE ua = ua_string;
/* It's not in the table, put it in the table */
CASE WHEN ROW_COUNT() = 0 THEN
INSERT INTO ua_strings (ua) VALUES (ua_string);
SELECT LAST_INSERT_ID() INTO ret;
ELSE BEGIN END;
END CASE;
RETURN ret;
END//
DELIMITER ;
A function with the hash would look very similar. Hide the implementation details in the function and benchmark the two.
And really don't use MD5. It's not going to affect performance to use SHA1, you can spare the extra 4 bytes per entry, and it will avoid hidden problems down the road. Using MD5 is like saying "Even though there's locks better in every way, I'll use this crappy lock because I don't think this door is important right now". You're not a security expert (neither am I) and don't know what parts are important and what aren't. Just put decent locks on everything. If SHA1 turns out to be some huge performance problem, you can always change it thanks to the encapsulation of the function.
No matter what the benchmarking results, I am willing to bet profiling will reveal your choice has no effect on the performance of whatever system this is a part of. Go with the simpler and more flexible indexed option, optimize it if it turns into a problem later.