247

Is email address a bad candidate for primary when compared to auto incrementing numbers?

Our web application needs the email address to be unique in the system. So, I thought of using email address as primary key. However my colleague suggests that string comparison will be slower than integer comparison.

Is it a valid reason to not use email as primary key?

We are using PostgreSQL.

giannis christofakis
  • 8,201
  • 4
  • 54
  • 65
robert
  • 8,459
  • 9
  • 45
  • 70
  • 6
    What do you mean by 'primary'? If the email address needs to be unique then it is a key and requires a unique constraint. Whether you decide to 'promote' it be being 'primary' is arbitrary, unless there is a practical reason for doing so e.g. optimizing a poorly performing system. – onedaywhen Sep 27 '10 at 14:53
  • 112
    @robert What if someone wants to change his email address? Are you going to change all the foreign keys too? – systempuntoout Sep 27 '10 at 20:54
  • 2
    a primary key is usualy used to be referenced by child entities. there may also be exports of child data to secondary systems/DB's. As an email tends to change over time this change needs to be cascaded. As a primary key I would use something less likely to change than an email address – MikeD Sep 28 '10 at 11:36

25 Answers25

300

String comparison is slower than int comparison. However, this does not matter if you simply retrieve a user from the database using the e-mail address. It does matter if you have complex queries with multiple joins.

If you store information about users in multiple tables, the foreign keys to the users table will be the e-mail address. That means that you store the e-mail address multiple times.

Sjoerd
  • 74,049
  • 16
  • 131
  • 175
  • 16
    @Sjoerd: The issue is not that the email-address is stored multiple times, although that's definitely inefficient, but who cares about hard drive space today. Most businesses don't have google-scale, where this would matter. The issue is that the email address cannot be changed afterwards, because it's both a primary key & referenced as foreign key. – Stefan Steiger Feb 05 '15 at 17:13
  • In case any one wonders, as I did, a GUID key would be equivalent to an email key I think. – tofutim Feb 12 '16 at 19:07
  • @StefanSteiger thanks for that information, so a uuid saved as a varbinary primary key would be better right ? – MADforFUNandHappy Nov 06 '20 at 13:17
  • @MADforFUNandHappy: Yes, uuid would be better. But you can use whatever you want as primary key (auto-incrementing int/bigint/int128, uuid, varbinary). It's just that if you put data into the primary-key, the data becomes unchangable, because you'll have foreign keys referencing the primary key (aka the data). Once that happens, you cannot change your data anymore (e.g. the email address). – Stefan Steiger Nov 06 '20 at 14:28
193

I will also point out that email is a bad choice to make a unique field, there are people and even small businesses that share an email address. And like phone numbers, emails can get re-used. Jsmith@somecompany.com can easily belong to John Smith one year and Julia Smith two years later.

Another problem with emails is that they change frequently. If you are joining to other tables with that as the key, then you will have to update the other tables as well which can be quite a performance hit when an entire client company changes their emails (which I have seen happen.)

HLGEM
  • 94,695
  • 15
  • 113
  • 186
  • 10
    ah, I don't like the saying at all... surrogate keys can also be the source of problems; yes, the application will be more robust to change of business and/or integrity rules, however the information can get lost a bit easier and the identity of records becomes less clear. so I would _not_ recommend a rule of a thumb here... – Unreason Sep 27 '10 at 13:57
  • 4
    I have never in 30 years of database work lost records due to a surrogate key. – HLGEM Sep 27 '10 at 14:31
  • 1
    The OP states very clearly: "Our web application needs the email address to be unique in the system" so your observation about "people and... businesses that share an email address" do not apply in this case. – onedaywhen Sep 27 '10 at 14:44
  • 1
    How could people share an email address? If email addresses are not unique, the whole email system breaks down. Unless you mean that, say, a husband and wife might share a single email account. But in that case, are they really two different customers? Now it depends on what your definition of a "customer" is. I've used email addresses as primary keys and never had a problem with that. Maybe in some cases it would be an issue. Have to look at the context. – Jay Sep 27 '10 at 14:51
  • 12
    @onedaywhen and @jay, just because you think it shoud be unique doen't make it unique. And yes a husband and wife might be different customers. Just becasue you haven't run into this before doesn't mean it won't happen. I have run into it and it does happen which is why email should never be allowed to be considered unique whether you think it should be or not. This is the kind of requirement you push back because it is inherently wrong. – HLGEM Sep 27 '10 at 15:15
  • 17
    @HLGEM: I don't want to get into an endless argument, but you can't say that a proposed key is not unique based on hypotheticals without knowing the context. e.g. from the phone company's point of view, a telephone number uniquely identifies a customer, by definition. Yes, you can say, "But what if there are two or three people who might answer when you call that number?" But this is irrelevant. From the phone company's point of view, by definition this is one customer. (continued ...) – Jay Sep 27 '10 at 16:37
  • 15
    (continued) Likewise, if you are building a system that is largely concerned with email communications -- perhaps a message dispatching system, or a notification forwarding system -- then it is likely that by definition, an email address uniquely identifies a user. If multiple people share that email address, that is irrelevant. They are a single message destination, therefore, they are a single user. "User" and "customer" do not have to be synonyms for "individual human being". – Jay Sep 27 '10 at 16:39
  • 3
    More likely, the e-mail address is going to be used as a login ID. That might inconvenience 1% of the customers but the other 99% will have a much happier experience relative to having to choose and remember a unique user name. The only other alternative is OpenID which most laypeople don't have. – Aaronaught Sep 27 '10 at 18:09
104

the primary key should be unique and constant

email addresses change like the seasons. Useful as a secondary key for lookup, but a poor choice for the primary key.

Steven A. Lowe
  • 60,273
  • 18
  • 132
  • 202
  • 19
    A property of a good key is that is should be stable but NOT necessarily immutable. – onedaywhen Sep 27 '10 at 14:55
  • 20
    if you have a choice, go for constant/immutable keys; less work for you down the road; just because SQL supports cascading updates doesn't mean it's always a good idea! – Steven A. Lowe Sep 27 '10 at 18:07
  • @StevenA.Lowe Could you elaborate why it would NOT be a good idea? Your comment looks very smart (I give you that) but it is missing the full explanation for poor mortals like us. ;) – Antonin GAVREL Nov 23 '22 at 01:53
  • @AntoninGAVREL email addresses are not constants; they can change e.g. when the user changes email providers. primary keys should never change. – Steven A. Lowe Dec 02 '22 at 00:12
  • Thank you Steven, indeed. The answer from @Robert also explains well why it would not be a good idea. – Antonin GAVREL Dec 04 '22 at 03:16
69

Disadvantages of using an email address as a primary key:

  1. Slower when doing joins.

  2. Any other record with a posted foreign key now has a larger value, taking up more disk space. (Given the cost of disk space today, this is probably a trivial issue, except to the extent that the record now takes longer to read. See #1.)

  3. An email address could change, which forces all records using this as a foreign key to be updated. As email address don't change all that often, the performance problem is probably minor. The bigger problem is that you have to make sure to provide for it. If you have to write the code, this is more work and introduces the possibility of bugs. If your database engine supports "on update cascade", it's a minor issue.

Advantages of using email address as a primary key:

  1. You may be able to completely eliminate some joins. If all you need from the "master record" is the email address, then with an abstract integer key you would have to do a join to retrieve it. If the key is the email address, then you already have it and the join is unnecessary. Whether this helps you any depends on how often this situation comes up.

  2. When you are doing ad hoc queries, it's easy for a human being to see what master record is being referenced. This can be a big help when trying to track down data problems.

  3. You almost certainly will need an index on the email address anyway, so making it the primary key eliminates one index, thus improving the performance of inserts as they now have only one index to update instead of two.

In my humble opinion, it's not a slam-dunk either way. I tend to prefer to use natural keys when a practical one is available because they're just easier to work with, and the disadvantages tend to not really matter much in most cases.

Jay
  • 26,876
  • 10
  • 61
  • 112
  • @Conrad: Although, he does point out that it's not a PITA if you have a engine that supports ON UPDATE CASCADE. It's a non-issue at that point code-wise; the only real issue is how extensive is the update and how wide is the key. Email address may be a bit much, but a CASCADE UPDATE for a PK of 2-character country code isn't a big deal. – Matthew Wood Sep 27 '10 at 20:58
  • 5
    @Matthew IMHO its still a PITA. For example assume that when you designed your country table there were only two tables that referenced it, no biggy, But over time it became 20 tables each with hundreds of thousands of records. Some with the reference some without. This makes a single logic write end up being tens of thousands of writes, and it doesn't make it to all the tables because someone forgot a reference when the added the table. This is exact thing happened to me on a 2 char country code table I kid you not. – Conrad Frix Sep 27 '10 at 22:00
  • @Wood & Conrad: The worst case is when there's no built-in DB support. Then you have to write code for it for every table with a posted reference, and this is just a pain and a door for bugs to slip in. With the cascades, you just have to remember to add one clause on each table, not such a big deal. – Jay Sep 28 '10 at 05:10
  • 2
    Advantage 1 and 3 are premature optimizations, advantage 2 is a very minor benefit and is completely overcome by any decent query tool. – Ash Sep 28 '10 at 08:39
  • 4
    @Ash: Thee's a difference between "optimizatin" and "premature optimization". But okay, by the same reasoning, all the disadvantages I've seen anyone mention are premature optimizations. So where does that leave you? As to #2, I find typing in extra joins when trying to do ad hoc queries to be a major pain. Records often have multiple foreign keys so you may need several joins to get to comprehensible data. If by "decent query tool" you mean one that figures out what data you want to see without you telling it and magically does the joins for you, I'd like to see how that works. – Jay Sep 28 '10 at 16:06
15

No one seems to have mentioned a possible problem that email addresses could be considered private. If the email address is the primary key, a profile page URL most likely will look something like ..../Users/my@email.com. What if you don't want to expose the user's email address? You'd have to find some other way of identifying the user, possibly by a unique integer value to make URLs like ..../Users/1. Then you'd end up with a unique integer value after all.

Simen Echholt
  • 11,243
  • 2
  • 34
  • 26
12

It is pretty bad. Assume some e-mail provider goes out of business. Users will then want to change their e-mail. If you have used e-mail as primary key, all foreign keys for users will duplicate that e-mail, making it pretty damn hard to change ...

... and I haven't even started talking about performance considerations.

meriton
  • 68,356
  • 14
  • 108
  • 175
  • How would changing email addresses cause there to be duplicates? Unless user A changes his email address, and then user B changes his email to be the same as user A's old value, and your updates are not done in sequence. Remotely possible, I guess. – Jay Sep 27 '10 at 15:09
  • 2
    A foreign key reference, by definition, contains the value of the primary key of the row it refers to. Put differently, it duplicates the value of the primary key. (So the duplicating is not caused by changing the value. But changing is harder due to this duplication, and the constraint enforcing it). – meriton Sep 27 '10 at 15:20
  • This is not a problem. Foreign-key cascading exists to solve this issue. If a user changes their email, the change will cascade to all the tables using it as foreign key. – Rafa Apr 17 '13 at 12:19
  • 1
    @rafa, I assure you that if you use cascading updates and a whole provider goes out of business or changes their name (Yahoo.com becomes HooYa.com), your database will be locked to all users for hours and maybe days while this cascaded through the system. It is a very valid problem (and a reason why it is a poor idea to use cascading updates if you have any significant amount of data and the key is likely to change.) – HLGEM Jul 31 '13 at 14:14
12

I don't know if that might be an issue in your setup, but depending on your RDBMS the values of a columns might be case sensitive. PostgreSQL docs say: „If you declare a column as UNIQUE or PRIMARY KEY, the implicitly generated index is case-sensitive“. In other words, if you accept user input for a search in a table with email as primary key, and the user provides "John@Doe.com", you won't find “john@doe.com".

xlttj
  • 1,158
  • 2
  • 10
  • 15
  • 7
    Worth mentioning in this connection that John@Doe.com and john@Doe.com may be the same mailbox or may be different mailboxes and you have no way of telling - there's nothing in the spec to say whether the local-part is case-sensitive. – telent Sep 29 '10 at 12:08
9

At the logical level, the email is the natural key. At the physical level, given you are using a relational database, the natural key doesn't fit well as the primary key. The reason is mainly the performance issues mentioned by others.

For that reason, the design can be adapted. The natural key becomes the alternate key (UNIQUE, NOT NULL), and you use a surrogate/artificial/technical key as the primary key, which can be an auto-increment in your case.

systempuntoout asked,

What if someone wants to change his email address? Are you going to change all the foreign keys too?

That's what cascading is for.

Another reason to use a numeric surrogate key as the primary key is related to how the indexing works in your platform. In MySQL's InnoDB, for example, all indexes in a table have the primary key pre-pended to them, so you want the PK to be as small as possible (for speed's and size's sakes). Also related to this, InnoDB is faster when the primary key is stored in sequence, and a string would not help there.

Another thing to take into consideration when using a string as an alternate key, is that using a hash of the actual string that you want might be faster, skipping things like upper and lower cases of some letters. (I actually landed here while looking for a reference to confirm what I just said; still looking...)

Rafa
  • 1,397
  • 15
  • 21
5

yes, it is better if you use an integer instead. you can also set your email column as unique constraint.

like this:

CREATE TABLE myTable(
    id integer primary key,
    email text UNIQUE
);
ibram
  • 4,414
  • 2
  • 22
  • 34
5

Yes, it is a bad primary key because your users will want to update their email addresses.

Bryan Legend
  • 6,790
  • 1
  • 59
  • 60
3

Another reason why integer primary key is better is when you refer to email address in different table. If address itself is a primary key then in another table you have to use it as a key. So you store email addresses multiple time.

klew
  • 14,837
  • 7
  • 47
  • 59
3

I am not too familiar with postgres. Primary Keys is a big topic. I've seen some excellent questions and answers on this site (stackoverflow.com).

I think you may have better performance by having a numeric primary key and use a UNIQUE INDEX on the email column. Emails tend to vary in length and may not be proper for primary key index.

some reading here and here.

Community
  • 1
  • 1
Saif Khan
  • 18,402
  • 29
  • 102
  • 147
3

Personally, I do not use any information for primary key when designing database, because it is very likely that I might need to alter any information later. The sole reason that I provide primary key is, it is convenience to do most SQL operation from client-side, and my choice for that has been always auto-increment integer type.

tia
  • 9,518
  • 1
  • 30
  • 44
3

You may need to consider any applicable data regulation legislation. Email is personal information, and if your users are EU citizens for instance then under GDPR they can instruct you to delete their information from your records (remember this applies regardless of which country you are based).

If you need to keep the record itself in the database for referential integrity or historical reasons such as audit, using a surrogate key would allow you to just NULL all the personal data field. This obviously isn't as easy if their personal data is the primary key

3

I know this is a bit of a late entry but i would like to add that people abandon email accounts and service providers recover the address allowing another person to use it.

As @HLGEM pointed out "Jsmith@somecompany.com can easily belong to John Smith one year and Julia Smith two years later." in this case should John Smith want your service you either have to refuse to use his email address or delete all your records pertaining to Julia Smith.

If you have to delete records and they relate to the financial history of the business depending on local law you could find yourself in hot water.

So i would never use data like email addresses, number plates, etc. as a primary keys because no matter how unique they seem they are out of your control and can provide some interesting challenges that you may not have time to deal with.

Robert
  • 3,328
  • 2
  • 24
  • 25
2

Your colleague is right: Use an autoincrementing integer for your primary key.

You can implement the email-uniqueness either at the application level, or you coudl mark your email address column as unique, and add an index on that column.

Adding the field as unique will cost you string comparision only when inserting into that table, and not when performing joins and foreign key constraint checks.

Of course, you must note that adding any constraints to your application at the database level can cause your app to become inflexible. Always give due consideration before you make any field "unique" or "not null" just because your application needs it to be unique or non-empty.

jrharshath
  • 25,975
  • 33
  • 97
  • 127
  • 2
    "Always give due consideration before you implement requirement x just because your application needs requirement x." -- the worst piece of advice I've read in quite some time. – onedaywhen Sep 27 '10 at 14:47
  • I'm not convinced by your "argument" -- in real life there will often be situations when some essential data (e.g, a phone number) will not be available immediately. If such a field is marked as NOT NULL in a database, it will require the users to pollute the data with dummy fields (like 123) instead of leaving it empty. It would be more practical to let the application handle the constraints (and in this case, the app could flag an empty field as a action item). – jrharshath Sep 27 '10 at 14:59
  • 5
    I agree that defining a field "not null" should be done cautiously. Requirements like "we always need the customer's phone number" should be considered carefully. Might it not be desirable at times to create a customer record even though we don't know the phone number right now, and go back and get it later? But "this field must be unique" is a different category. I can't imagine saying "It's okay for two employees to have the same social security number, we'll figure it out later." How would you ever straighten out the data? – Jay Sep 27 '10 at 15:13
2

Use a GUID as a primary key... that way you can generate it from your program when you do an INSERT and you don't need to get a response from the server to find out what the primary key is. It will also be unique accross tables and databases and you don't have to worry about what happens if you truncate the table some day and the auto-increment gets reset to 1.

JoelFan
  • 37,465
  • 35
  • 132
  • 205
1

primary key should be chosen a static attribute. Since email addresses are not static and can be shared by multiple candidates so it is not a good idea to use them as primary key. Moreover email addresses are strings usually of a certain length which may be greater than unique id we would like to use[len(email_address)>len(unique_id)] so it would require more space and even worst they are stored multiple times as foreign key. And consequently it will lead to degrade the performance.

user2719152
  • 939
  • 3
  • 11
  • 20
1

you can boost the performance by using integer primary key.

Second Person Shooter
  • 14,188
  • 21
  • 90
  • 165
1

you should use an integer primary key. if you need the email-column to be unique, why don't you simply set an unique-index on that column?

oezi
  • 51,017
  • 10
  • 98
  • 115
1

If you have a non int value as primary key then insertions and retrievals will be very slow on large data.

Amareswar
  • 2,048
  • 1
  • 20
  • 36
  • 1
    No, inserts it will be *slower*, because you need **two** unique indexes: one on the generated primary key and another one on the email address. –  May 04 '13 at 13:01
0

It depends on the table. If the rows in your table represent email addresses, then email is the best ID. If not, then email is not a good ID.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
0

If it's simply a matter of requiring the email to be unique then you can just create a unique index with that column.

Micah
  • 111,873
  • 86
  • 233
  • 325
0

Email is a good unique index candidate, but not for primary key, if it is a primary key, you will be no able to change the contact's emails address for example. I think your join querys will be slower too.

Chocolim
  • 54
  • 1
  • 2
-2

don not use email address as primary key , keep email as unique but don not use it as primary key, use user id or username as primary key

Nikki
  • 409
  • 1
  • 5
  • 15
  • Why? You described some conclusion or your position on the topic, but this doesn't give reader any understanding of the issue. – kravemir Oct 23 '22 at 08:34