6

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.

Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)

So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)

So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.

Thanks in advance.

biogeo
  • 63
  • 3
  • With correct indices, the "performance" is largely "irrelevant" - the overheads are constant. As such, consider the question from a non-performance viewpoint. – user2864740 Feb 18 '14 at 22:08
  • 1
    possible duplicate of [Surrogate vs. natural/business keys](http://stackoverflow.com/questions/63090/surrogate-vs-natural-business-keys) – Mitch Feb 18 '14 at 22:08
  • 1) Your foreign key doesn't need to be on your primary key column. 2) If you have a one-to-many relationship (one parent with many children), your child's foreign key column _definitely_ shouldn't be auto increment. 3) the foreign key constraint can have something like RESTRICT or CASCADE to avoid accidental errors, depending on what exactly you'll be doing. – Brian S Feb 18 '14 at 22:09
  • @Mitch, you're right, this does seem to be a duplicate! My searching failed me. Thanks for pointing me in the right direction. It's also somewhat gratifying to see that this is an open debate among more experienced database folks. – biogeo Feb 18 '14 at 23:11
  • @biogeo, it is certainly a debate... it's the DBA version of Windows VS Mac, or emacs vs VI. – Mitch Feb 19 '14 at 02:12

4 Answers4

8

I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.

Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).

There are better reasons to use an auto-incremented primary key.

  1. You know which values were most recently inserted.
  2. If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
  3. If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
  4. You don't have to worry about leading spaces, trailing spaces, and other character oddities.

You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

Gordon is right (which is no surprise).

Here are the considerations for you not to worry about, in my view.

When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.

INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.

Here's what you should worry about.

There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says

 WHERE colm IS NULL   /* slow! */

or

 WHERE colm IS NOT NULL  /* slow! */

This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.

 WHERE SUBSTR(colm,1,3) = 'abc'  /* slow! */

One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.

 parent       varchar(20)   pk  fk to parent table
 birthorder   int           pk
 name         varchar(20)

Then, you could have rows like...

  parent      birthorder     name
  homer       1              bart
  homer       2              lisa
  homer       3              maggie

But, if you tried to insert a fourth row here like this

  homer       1              badbart

you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.

Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
0

The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.

Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.

Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.

This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.

Damien Black
  • 5,579
  • 18
  • 24
0

Numeric row id's have many advantages over a string based id. Most of them are mentioned in other answers. 1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient. 2. Numeric fields are stored much more efficiently 2. Joins are much faster with numeric keys. 3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient 4. I think using a auto-increment on primary key has its advantages too

-Thanks _san

tempusfugit
  • 437
  • 1
  • 3
  • 14