220

I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?

For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,

well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.

I am using MySQL database

Engr.Aftab Ufaq
  • 3,356
  • 3
  • 21
  • 47
mainstringargs
  • 13,563
  • 35
  • 109
  • 174

15 Answers15

242

Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.

Jamie Lester
  • 848
  • 9
  • 20
kemiller2002
  • 113,795
  • 27
  • 197
  • 251
  • 15
    wouldn't it depend on the database? I would think a properly indexed string would not be that much slower if at all from a number? – Ryan Guill Feb 05 '09 at 20:09
  • 2
    I would agree there are a lot of variables to consider. (In sqlserver) we have seen real performance issues with using strings with lengths in the mid to high teens and above even when indexed. Buy you're right there are things to overcome this hardware for instance. – kemiller2002 Feb 05 '09 at 20:12
  • 1
    Fair enough. I would agree though that if a string makes sense, that is what you should use. I would also say there are definitely times for GUID or UUID fields in databases where an autoincrement field would not work. – Ryan Guill Feb 05 '09 at 20:16
  • 7
    Also keep in mind that there's often a very big difference between a CHAR and a VARCHAR when doing index comparisons – Tom H Feb 05 '09 at 20:42
  • 7
    The number of comments of this answer makes it clear how incomplete it is. Mentioning indexing would have been the minimum acceptable answer. – Pedro Rolo Mar 29 '11 at 17:56
89

Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.

Tohid
  • 6,175
  • 7
  • 51
  • 80
Jeff Martin
  • 10,812
  • 7
  • 48
  • 74
  • 2
    This can cause "hot spots" for new inserts though. As long as you're managing your database properly, you should have extra space on your pages for inserts anyway and page splits should be rare. – Tom H Feb 05 '09 at 19:50
  • 22
    that is when primary keys are clustered. you can create them unclustered too. – Learning Feb 05 '09 at 19:59
  • XIDs are ordered which might help if you just use xid strings – Sinaesthetic Apr 07 '20 at 21:50
  • Can someone explain to me why "the index has to be resequenced". As far as I know db indexing data by using balanced binary tree. So it doesn't care if new row is in the middle or not. – Che Huu Apr 12 '22 at 11:12
  • I suppose it depends on the database technology you are using at the time. This question is over 13 years old, so at the time, string based indexes caused issues in SQL server. – Jeff Martin May 12 '22 at 23:23
28

Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.

Mark Thompson
  • 316
  • 3
  • 2
  • Good explanation. But holds this true for all SQL databases? I've heard of MySQL performance issues when using random UUID as primary key. – hgoebl Aug 27 '17 at 11:08
19

Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.

I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • 35
    Strings that are good candidates for PKs do not have duplicates - otherwise they wouldn't be a good candidate for a PK. Think of ICD-9 codes, country codes, VIN #s. Using a name as an example of a problem with natural keys is misguided, because they should never be a candidate in the first place. – Tom H Feb 05 '09 at 20:35
  • 6
    @Tom H: ISO County codes DO change. [ http://en.wikipedia.org/wiki/ISO_3166-1#Editions_and_changes ] As an answer on a related question said [ http://stackoverflow.com/questions/925266/database-design-and-the-use-of-non-numeric-primary-keys/925290#925290 ] "For PRIMARY KEY's make sure their uniqueness is under your control" – Steve Schnepp May 29 '09 at 11:04
  • 4
    @SteveSchnepp: yes and the ISO is the trusted body for managing that change. On the other hand, when you need to merge your monotonic sequence of incrementing integer values with someone else's, you are on your own ;) – onedaywhen Mar 21 '12 at 08:56
  • 1
    I would agree that names shoud not be considered as a key , I have just seen to omany times when they were. – HLGEM Mar 21 '12 at 13:44
  • 1
    @onedaywhen merging 2 monotonic sequence of incrementing integer is quite easily done via prefixing or suffixing :) – Steve Schnepp Mar 21 '12 at 16:25
  • Good answer! You may think to use strings in the ID to add a prefix, but like you said, what if the prefix changes. Instead store the prefix in a separate field and concatenate it onto the id when displaying. – CrazyTim Dec 13 '12 at 04:51
11

It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.

If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.

Al Katawazi
  • 7,192
  • 6
  • 26
  • 39
10

Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...

Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.

Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • that makes no sense cloutierm, what do you mean? – HLGEM Feb 05 '09 at 19:48
  • @HLGEM: If I understand him write, he means like syncing records created on a laptop with the main db. – Joel Coehoorn Feb 05 '09 at 19:49
  • I mean I have two separate databases with the same entities, only one is updated less frequently for persistent storage purposes. If I Query for entity "California" on Database A, I want it to be fundamentally the same "California" on Database B. – mainstringargs Feb 05 '09 at 19:52
  • 1
    And it is 'like' syncing records created in a laptop in that it's the same problem: records created in one place should not conflict with records created in another. One possible solution here is Guid keys. – Joel Coehoorn Feb 05 '09 at 19:59
6

Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.

In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.

Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?

Are the programmers and interactive users who specify queries able to use the natural key to get what they want?

If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.

Walter Mitty
  • 18,205
  • 2
  • 28
  • 58
4

Two reasons to use integers for PK columns:

  1. We can set identity for integer field which incremented automatically.

  2. When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.

drs
  • 5,679
  • 4
  • 42
  • 67
4

Indices imply lots of comparisons.

Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.

Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
3

Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.

One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.

Yes - that Jake.
  • 16,725
  • 14
  • 70
  • 96
3

What is your reason for having a string as a primary key?

I would just set the primary key to an auto incrementing integer field, and put an index on the string field.

That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.

You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.

John B
  • 20,062
  • 35
  • 120
  • 170
  • 3
    I think putting any intelligence in a key is asking for trouble. Will they stay unique? Did they start all the account numbers with the state's abbreviation at the begining only to the client move. Update a field - no problem - all those tables linked by account number - what a mess. – JeffO Feb 09 '09 at 02:46
  • 1
    An example of using a string as a PK could be a table of settings. e.g. settingNamePK, isUserEditable, isCustomerEditable etc Then if you wanted to modify the setting behaviour "UPDATE setting SET ... WHERE settingNamePK='dailyWorkObligation'" is much nicer than having to use ID's and store somewhere the mapping of the ID's. Of course you could have an integer PK and have the setting name as another unique key as well. – MeatPopsicle Apr 11 '13 at 13:42
  • With the primary key being an auto-incremented integer, shouldn't inserts also be unaffected in their speed? – Dennis Oct 02 '14 at 17:22
  • For curious Rails devs, here's [how to specify an index length](http://apidock.com/rails/ActiveRecord/ConnectionAdapters/SchemaStatements/add_index). Note that SQLite doesn’t support index length. – Dennis Oct 02 '14 at 17:26
3

From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.

From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.

Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.

3

I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.

create table sample (
  sample_pk             INT NOT NULL AUTO_INCREMENT,
  sample_id             VARCHAR(100) NOT NULL,
  ...
  PRIMARY KEY(sample_pk)
);

You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).

Jerry Chen
  • 321
  • 3
  • 4
2

By default ASPNetUserIds are 128 char strings and performance is just fine.

If the key HAS to be unique in the table it should be the Key. Here's why;

primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).

The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.

So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).

To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.

haxxxton
  • 6,422
  • 3
  • 27
  • 57
JPoole
  • 21
  • 1
2

There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.

The speed of using number or string is more dependent on the indexing then the type itself.

takacsot
  • 1,727
  • 2
  • 19
  • 30