performance penalty of strings as primary keys?

Question

What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.

An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.

Duplicate? http://stackoverflow.com/questions/517579/strings-as-primary-keys-in-sql-database — martin clayton, Feb 12 '10 at 10:05

score 4 · Accepted Answer · answered Feb 12 '10 at 10:05

on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.

The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.

If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.

So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).

The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.

anthares · Answer 2 · 2010-02-12T09:00:34.397

3

In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable. We use SQL Server 2005.

Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.

edited Feb 12 '10 at 09:00

answered Feb 12 '10 at 08:54

anthares

11,070
4
41
61

1

There is a GUID type in SQL Server. Also, it is ideal for replication. – Timmy Feb 12 '10 at 13:14

Mitch Wheat · Answer 3 · 2010-02-12T09:50:31.400

It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.

Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.

Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.

score 1 · Answer 4 · answered Feb 12 '10 at 08:58

One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).

Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.

performance penalty of strings as primary keys?

4 Answers4

Linked

Related