1

I have a MySQL table where I would like my primary key to be a string. This string may potentially be a bit longer (hundreds of characters).

A very common query would be an INSERT ... ON DUPLICATE KEY UPDATE, which means MySQL would have to check whether the primary key already exists in the table a lot. If this is done with a naive strcmp I imagine this might take quite a while the longer the strings are. Would it thus be better to hash the string manually (either to a shorter string or some other data type) and use that as my primary key or can I just use the long string directly? Does MySQL hash primary key strings internally?

Nils
  • 1,936
  • 3
  • 27
  • 42
  • 2
    You're not going to single-handedly outsmart a code base boasting millions of lines and hundreds of observers. Let MySQL do its thing. – Denis de Bernardy Jun 01 '13 at 10:16
  • 2
    Read here http://stackoverflow.com/questions/517579/strings-as-primary-keys-in-sql-database – Yogus Jun 01 '13 at 10:16

2 Answers2

6

First off, when you have an index on a varchar field mysql doesn't do a strcmp on all entries to find the correct one; instead it uses a binary tree, which is a lot faster than strcmp to navigate through to find the proper entry.

Note: I include some info to improve performance if needs be below, but please do not do that until you hit an actual problem. Varchar indexes are quick, they have been optimized by a lot of very smart people, and in the large majority of cases it will be way more than you need.

With that said, if you have a lot of entries and/or very long keys it can be nice performance wise to use an index of hashes on top of it.

CREATE TABLE users
(
    username varchar not null,
    username_hashed varchar(32) not null,
    primary key (username),
    index (username_hashed)
);

When you insert you can set username_hashed = md5(username) for example. And then you search with something like select otherfields from users where username_hashed = md5(username) and username = username

Note that it seems mysql 5.5 support hash index natively, which would allow you to not have to do that by hand.

Lepidosteus
  • 11,779
  • 4
  • 39
  • 51
  • I'd +1 for the first part of your answer, but your `username_hashed = md5(username)` suggestion is a terrible one. In addition to the potential for duplicate entries, adding the column will merely slow things down due to the overhead of needing to (a) calculate and store it and (b) maintain the index on it. – Denis de Bernardy Jun 01 '13 at 10:32
  • 1
    @Denis: the duplicate is not really an issue as you check on the hash and then the actual value (so even if you have collisions - which I agree you eventually will, mysql then have to match only a couple of entries instead of the whole index). And I found that while it slows down inserts a little (no update on a primary key), it gives a very nice boost in select speed once it reaches millions of long strings. Depends on your needs I guess, thus my bold warning, and I would use a saner design if possible, but if not this can help. – Lepidosteus Jun 01 '13 at 10:38
  • "the duplicate is not really an issue as you check on the hash and then the actual value" -- but then, the primary key's index will always get used and the hashed value won't be used at all -- except to slow the query down by adding a separate check. Or does MySQL actually try to use the index on the hashed value? – Denis de Bernardy Jun 01 '13 at 10:40
  • @Denis: your question makes me re-read on mysql after a year of pg, and that behavior may have changed. At the time, it did use the hash index, but that was an earlier version of mysql 5, it would need testing to confirm on modern version. – Lepidosteus Jun 01 '13 at 10:51
  • Are you sure that hash check is faster than just checking plain text? Text can be skipped by length and all hashes are the same. – Ievgen Nov 18 '20 at 07:47
  • @Lepidosteus so you are saying that you have checked both and back to the day hash check was faster? – Ievgen Nov 18 '20 at 07:49
0

Does the primary key need to be a string? Can't it just be a unique index, with an integer primary auto increment?

Searching will always be faster with integers, and it might take a bit of code rearrangement in your app, but you'll always be better off searching numbered primary keys vs. strings. Look at these two posts that show the difference in memory for int and varchar:

What is the size of column of int(11) in mysql in bytes?

Memory usage of storing strings as varchar in MySQL

Community
  • 1
  • 1
Eric Leroy
  • 1,830
  • 1
  • 18
  • 31