6

Question

Is it a good rule of thumb for database IDs to be "meaningless?" Conversely, are there significant benefits from having IDs structured in a way where they can be recognized at a glance? What are the pros and cons?

Background

I just had a debate with my coworkers about the consistency of the IDs in our database. We have a data-driven application that leverages spring so that we rarely ever have to change code. That means, if there's a problem, a data change is usually the solution.

My argument was that by making IDs consistent and readable, we save ourselves significant time and headaches, long term. Once the IDs are set, they don't have to change often and if done right, future changes won't be difficult. My coworkers position was that IDs should never matter. Encoding information into the ID violates DB design policies and keeping them orderly requires extra work that, "we don't have time for." I can't find anything online to support either position. So I'm turning to all the gurus here at SA!

Example

Imagine this simplified list of database records representing food in a grocery store, the first set represents data that has meaning encoded in the IDs, while the second does not:


ID's with meaning:

Type
1 Fruit
2 Veggie

Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot

Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52 Aisle five bottom shelf

ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4

ID's without meaning:

Type
1 Fruit
2 Veggie

Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot

Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf

ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4

Summary

What are the pros and cons of keeping IDs readable and consistent? Which approach do you generally prefer and why? Is there an accepted industry best-practice?

-------- edit ( helpful background info from comments, below ): --------

In our tables, the Primary Key is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, certain developers also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) vehemently disagreed for theoretical reasons. In practice, not one of those objections are holding true. Moreover, all developers using the data agree that it's now significantly easier to maintain.

I'm looking for (but haven't seen) a defensible argument against using immediately recognizable IDs in a data-centric environment.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
gMale
  • 17,147
  • 17
  • 91
  • 116
  • 4
    It seems silly to encode the relationship information into the ID, given that relational db's inherently maintain this for you. Plus, database id's aren't necessarily suppose to be human readable. If you need relationship data that is easily human parsable, then either build queries/views to show the data, or consider a different mechanism for storing your data. The only real reason to consider human parsable information is for debugging (imo) – Alan Feb 09 '11 at 22:35
  • @Alan: I've never done a Community Wiki before. If I wanted to, how would I convert this question? Is conversion even possible? – gMale Feb 10 '11 at 02:55
  • @Alan: You've gotten exactly to the heart of the issue: If you will spend a MASSIVE amount of time debugging data and it requires little effort to make the IDs "consistent" and "readable," then is it truly "evil" to do so? – gMale Feb 21 '11 at 14:13
  • @gmale I don't think that silly question can draw 11 answers, one of them highly upvoted. There must be *something* good in the question. – Shadow The GPT Wizard Mar 05 '11 at 16:02
  • 1
    Please don't vandalize your question. – balpha Mar 05 '11 at 16:49

11 Answers11

22

Con: I just changed "Aisle Five top shelf" to "Aisle Six top shelf" so now I have to change its ID to 61 and now I have to change the ProductLocation ID for "Grapes on Aisle five top shelf" to be 10461 and oh god where else does the shelf location ID string show up in IDs in my database oh god whoever designed IDs to carry meaning ought to be shot it's four in the morning and everything's gone crazy and why does "Aisle seven bottom shelf" have an ID of 41 die die die.

CanSpice
  • 34,814
  • 10
  • 72
  • 86
  • the same thing happens with variable/method names all the time. I could name all my methods: method2231C(). But I don't. Why? Readability and maintenance. Yes, when my code is refactored, that means names have to change. A lot of time is spent doing exactly that. And, to me, it's worth it (in the long-term). – gMale Feb 09 '11 at 22:43
  • 8
    @gmale: There's a difference between variable/function names (which are designed to be read by humans, and impart semantic information), and row IDs (which aren't designed to be read by humans, nor impart semantic information). – Oliver Charlesworth Feb 09 '11 at 22:48
  • @gmale: I would like to see your response to Marc B's answer here. Also, variable names in code are analogous to table/column names. They describe **what** the data is, not what the value of the data is, and that's what you're proposing the ID is for: for describing the value of the data in that row. Well, the value of the data of a row already describes the value of the data in that row, why do you need a second piece of data to do this? And a piece of data that can so easily get out of sync with the actual data? – CanSpice Feb 09 '11 at 22:50
  • @Oli: I would agree, except our application is HEAVILY dependent on data. Nearly 100% of our bugs stem from data errors. Those errors are exceptionally difficult to spot in a tables that contains only ids (like ProductLocation, above). So, to the contrary, our data (and I'd argue most data) is constantly read by humans and maintained for years into the future. – gMale Feb 09 '11 at 22:52
  • @gmale: And having to change data in two places in one row (not to mention all of the other places in the DB where a given ID is used) is supposed to reduce the number of data-driven errors you get? – CanSpice Feb 09 '11 at 22:55
  • @gmale: If you are having integrity problems, then it's only going to be made worse by adding redundant information in the decimal representation of an ID that *must* be kept in sync with the actual contents of the row. How is "10141, 101, 41" more helpful than "1, 1, 1" to a human trying to debug the system? – Oliver Charlesworth Feb 09 '11 at 22:56
  • @CanSpice: reguarding MarkB's response, the reality of our project is most datasets are small enough where the ids can be sufficiently padded, where necessary. I think in some cases you have to factor in the problem domain. And in our application, the data is everything so, in several contexts it is very helpful when that data is instantly readable without having to cross reference 3 or 4 tables or views. Especially, when assigning and maintaining the IDs is quite painless. – gMale Feb 09 '11 at 22:57
  • @Oli: 10141 is exceptionally helpful to a human trying to debug the system because when you see it, you know EXACTLY what you're dealing with without having to cross-reference several other tables. And if it's JUST AS EASY to assign 10141 as it is to assign 111, then why not go the route that will save you headaches in countless situations? People are placing "high level theory" above "simple practicality" for no other reason than, it's how it's "always" done. To me, there are times to question the norm... – gMale Feb 09 '11 at 23:03
  • 3
    @gmale: your explanation doesn't make sense. If you're trying to fix data issues, then write queries that expose this information. You're only going to exacerbate the problem by now encoding information in the id. The database will help you maintain integrity, if you let it. If you devise your own metaschema, you might allow visual inspection of data, but you lose machine ability to verify. – Alan Feb 09 '11 at 23:06
  • @gmale: How does "101" confer more semantic information ("apple") than "1"? It's not "just as easy" (in fact I'm not suggesting "assigning" 111, I'm suggesting letting the database assign via auto-increment); it's a great deal more error-prone if your data ever changes, it's up to you to carefully update all the IDs, otherwise you're screwed! – Oliver Charlesworth Feb 09 '11 at 23:06
  • @CanSpice: regarding changing the data in multiple places... think like refactoring in eclipse (or your favorite IDE). There are ways to do it where you change it in one place and it propagates everywhere. It's very easy to set the data up to be "refactor-freindly." Trivially easy because the data is all related and normalized! So if and ID needs to change, change it in 2 seconds and since everything is consistent, it adjusts everywhere. – gMale Feb 09 '11 at 23:07
  • 4
    @gmale: You haven't addressed the "41 doesn't mean '4th shelf bottom' any more it means '5th shelf top'" problem, and thus the "10141 doesn't mean what you think it means" problem, and thus you'll have to go to the Shelf table to find out what 41 means anyhow. Really, it sounds like you had this idea to have semantic IDs in your database and everybody you talk to is trying to tell you that it's not a very good idea, yet you are firmly convinced that you're right. Well, sometimes the norm is the norm for a reason. – CanSpice Feb 09 '11 at 23:07
  • @Alan: I see what you're saying but your assertions are incorrect. This whole conversation is making it more clear to me that the problem domain always has to be considered. We do our data entry in spreadsheets that get pulled into the database on the fly (during deploy). Spreadsheets make it incredibly easy for a "machine to verify" your data. To me, that small fact refutes 99% of the opposing arguments I'm hearing. – gMale Feb 09 '11 at 23:10
  • 1
    @gmale: Everything is not magically "consistent". It's left to you (the programmer) to enforce that *every single reference* to a changing ID is updated, atomically. The database engine won't help you *at all* with that, especially not with IDs that have other IDs buried inside them. – Oliver Charlesworth Feb 09 '11 at 23:12
  • @CanSpice: I'm open to what you're saying but all the arguments center around the difficulty of changing things. First of all, the data isn't changing in the fundamental ways that would make this problematic. Secondly, as a programmer, making things change automatically is very easy. We all like the principles of DRY and when another programmer tells me that keeping IDs consistent in a normalized database is *HARD*, I can't believe my ears. – gMale Feb 09 '11 at 23:15
  • @gmale: It is hard, because there's nothing to help you maintain integrity. And if you like the priniciples of DRY, why are you RY? – Oliver Charlesworth Feb 09 '11 at 23:18
  • @Oli: I'm aware that things aren't magically "consistent." I'm also aware of how easy it is to make computers enforce simplistic rules. Especially when "the database engine" isn't my only tool. Namely, our existing spreadsheet makes the process trivial. And last but not least *the ids don't change, once they're set*. Imagine a dataset where all IDs were permanent. Would your arguments still hold? – gMale Feb 09 '11 at 23:18
  • @gmale: No, my arguments about consistency wouldn't hold (but I would still question how you can hold your belief that "it's easy to make computers enforce rules" whilst simultaneously stating that "nearly 100% of our bugs are from data errors"). But my arguments about DRY *would* still hold. And it'd also bring back @CanSpice's original argument about what happens when the underlying "data" changes (Aisle 4 becomes Aisle 7, or whatever), and it no longer matches your IDs. – Oliver Charlesworth Feb 09 '11 at 23:21
  • @gmale: so if I understand correctly, spreadsheets will validate your data prior to loading, Thus you should not have data issues (or at the very least, very slight ones). Then what other reason would you have to manually inspect database IDs? FWIW, Spreadsheets are very good at dealing with *tabular* data. – Alan Feb 09 '11 at 23:22
  • @gmale: In a world where the ID carries no meaning, if I change my shelf naming convention from 1,2,3... to m,n,o..., all I have to change is the data fields in the Shelf table. In a world where the ID carries meaning, not only do I have to change the data fields in the Shelf table, I have to change the ID fields too, and then I have to change two fields in the ProductLocation table for each row in that table, and then I have to remember all of the other places that the location is used. Tell me, which way is easier? – CanSpice Feb 09 '11 at 23:23
  • @gmale: if you have no data issues, and the data is set, then the argument becomes--you gain no benefit from it, and requires *extra* work. Unless I'm missing the reason why you want human parsable id's. – Alan Feb 09 '11 at 23:24
  • @Oli: Again, we're back to the issue of change. I agree with you two that if things change in fundamental ways it can get *slightly* difficult, sort of. However, it's very easy to use a spreadsheet (and underlying formulas) to isolate the pieces that relate to one another. Meaning, you set the core id's once and don't repeat yourself. Every other id, that builds on the "core" IDs is created via formula. This takes all of about 30 minutes to set up. And in problem domain like ours, it removes the issue of change completely off the table. From there, to me, the benefits are accentuated. – gMale Feb 09 '11 at 23:25
  • 2
    @gmale: If the values *can* change, then we're back to the question "why is 101 encoded into the ID more useful to a human than the value 101 (or simply 1, for that matter) stored in another column in the row, given that it could be 102 next time?" A human debugger will always have to cross-reference. – Oliver Charlesworth Feb 09 '11 at 23:27
  • @Alan: I should clarify, the IDs are set. Certain other values do change but never the IDs. Also the reason for human parsable IDs is because humans are looking at, working with, maintaining and making sense of those IDs at several different steps in the process. Not the least of which is testing, debugging and displaying things on the front end. I agree, it's bad practice to make assumptions *in production code* as to the value of IDs, however it is INCREDIBLY handy to be able to understand an ID on sight, without any additional effort. Especially when tweaking things in testing/debugging – gMale Feb 09 '11 at 23:28
  • @gmale: Another argument against: This is very definitely *not* normalised. – Oliver Charlesworth Feb 09 '11 at 23:30
  • @gmale: If the IDs are set while their values can change, then how does "the ID for shelf 4 bottom is 41" help you when the value for ID 41 can change to "shelf 5 top"? How does that help you understand what the ID is? – CanSpice Feb 09 '11 at 23:30
  • @Oli: I think I understand your question and it's a good one to contemplate--if it's so important, why not store the information you're trying to put in the ID in another field. Because those other fields would need to be present everywhere the ID is used, or else cross-reference is necessary. I know referencing other tables will be needed at times, but why make things more complicated (in those instances where it's not absolutely NEEDED) when there is very little pain involved in making things "readable?" – gMale Feb 09 '11 at 23:34
  • @all: I'm enjoying this debate--you guys have lots of EXCELLENT points--but I have to head out of the office. I'll pick up when I get home... – gMale Feb 09 '11 at 23:38
  • @CanSpice: Good question. "shelf 5 top" is the piece "tied" to the ID. To explain what I mean when I say some values change, following my example, that row in the table would have *other* location information that changes. Like the timestamp. 4 is always used for "aisle 4" and 1 is always used for "top." Together that creates "41." That is something that does not change (assuming the underlying IDs, 4 & 1 stay set--and they do in our actual problem domain). – gMale Feb 09 '11 at 23:45
  • The debate is strange as the problem was solved in 1960s, when RDBMS were created. The ids SHOULD be meaningful. And all fields MUST BE ATOMIC. So for a table of locations, where MEANING is not atomic but complex, you need an ID that comprises two columns. The location table should have two columns (eg. number_of_isle int, type_of_shelf varchar(10)) and a constraint like: primary key (number_of_isle, type_of_shelf). This is called 'normalization' - even if you plan not to adhere to it, it's nice to know. – fdreger Feb 11 '11 at 17:12
6

There are several problems with using database IDs to encode information about a row. If you want your carrots to have an "ID" of 203, you should add a product_id column (for example) and put this information there instead. Why?

  1. By customizing your IDs, you have to add domain-specific code that manages your IDs and you can't rely on database features like auto-incrementing or UUIDs.
  2. If you ever have to change your classification, it will mess up your table relations, browser bookmarks, search engine results, etc.
  3. It's not common practice -- so when you put application- or domain-specific data into the ID field, it will be assumed by many that this is meaningless information, when it isn't. You will need a data dictionary (and you will have to ensure that people read the data dictionary) to note the fact that this is valuable information.

The only required purpose of an ID is to uniquely identify a row within a table. If it can provide good lookup performance, that's a bonus, and if it can be compactly stored, that's another bonus. But it shouldn't contain any information about the entity in the row it identifies, other than the unique identifier of that entity.

alexantd
  • 3,543
  • 3
  • 27
  • 41
  • @alexantd: great answer. I like it enough to accept it but I have a question. Assuming that creating a product_id column is a more acceptable approach; in this case, that column would also be unique. Given this property, couldn't it serve as the primary key? Yes. At that point, your primary key is a unique field that is also useful in lots of contexts. What's wrong with having the ID be instantly recognizable? One glance and you know what it represents? Especially if the cost of maintaining it is minimal (as measured in the realities of the problem domain, as opposed to theory)? – gMale Feb 10 '11 at 02:16
  • @gmale: 1) It's important to distinguish between primary and unique keys. A primary key is always unique, but not necessarily vice versa. Your products have other unique properties: name, SKU, etc. Why not make one of these the primary key instead? 2) The primary key may be instantly recognizable to *you* - but the risk is that nobody else will know what it represents. Is it the product ID? number of sales in a 2-week period? Calorie count? Etc. 3) The maintenance is even lower if you do it the right way. :) – alexantd Feb 10 '11 at 17:08
  • @alexantd: "It's important to distinguish between primary and unique keys". Why? A primary key just means any of the keys of a table. A primary key is not fundamentally different to any other key. So for practical reasons it's often *useful* to designate one key as a "primary" one but it isn't particularly important to do that, i.e. it doesn't make any difference to the form or function of the database. – nvogel Feb 11 '11 at 14:11
  • @dportas: 1) A primary key can't be null. A unique key (by itself) can. 2) It's true that you could design a database with no primary keys, but the OP isn't doing that. (I take "IDs" to mean "primary keys"... he/she can correct me if that is mistaken - which would of course totally invalidate all of these answers :) – alexantd Feb 11 '11 at 23:51
  • No key can be null. That's a requirement of ALL keys, not just the one you designate as a "primary" one. If you have a nullable column then that obviously isn't a key. – nvogel Feb 12 '11 at 07:34
  • @dportas: A value in a unique key column, in the absence of a separate NOT NULL constraint, can be null. I just tried it myself in MySQL. Maybe MySQL is violating some kind of fundamental relational edict (it wouldn't be the first time). – alexantd Feb 15 '11 at 18:50
  • @alexantd: regarding your first response (2/10 17:08): 1) exactly my same question, what's the con of making a non-null, unique field a PK? 2) why is it a "risk" if no one else knows what it represents? No one else knows what a auto-increment id represents, either. Those people who don't know what it represents will be unaffected. That is, most people don't care what the ID is. For the few who do (debuggers for example) "recognizable" IDs make working with the data easier 3) we've done it "the right way" and maintenance is HIGHER for the most critical people in the system—those fixing problems – gMale Feb 21 '11 at 14:43
  • @alexantd: regarding your other response (2/11 23:51): you are correct, when I say "ID" I'm referring to the "primary key." In our tables, the PK is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, we also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) disagreed for theoretical reasons. In practice, over here in reality, none of those objections are holding true. – gMale Feb 21 '11 at 14:52
  • @gmale: 1) I fear we are going around in circles here. 2) It is universally understood that an auto-increment ID represents *nothing* other than its row. When your IDs instead represent *something*, you must make it known to whomever it may concern what that *something* is. 4, 15, 16, 18, 32 - I know what those numbers represent, do you? 3) You'd have to explain what kind of maintenance you're talking about - maybe in a separate question such as "why should I use auto-incrementing pkeys when my own system works better?" - I guess I've never run into problems like this due to my ID scheme. – alexantd Feb 22 '11 at 03:25
  • @gmale: If you've found something that works, I'm not going to tell you to stop doing it. You originally asked whether database IDs should be meaningless and you got a lot of responses. Like any of a million topics in software development, one camp will argue one thing, another camp will argue another thing, and at the end of the day, both camps get real work done. So, it sounds like your problem is solved for the time being and hopefully you've benefited from the reflection in this thread. – alexantd Feb 22 '11 at 03:35
  • @alexantd: I wanted to close this question but couldn't. I feel your answer was the best. After all this, I agree—PKs should not contain embedded info. However, in our PARTICULAR situation this doesn't hold true, mainly because we are restricted from pursuing more traditional solutions. Also, our code is excessively generic to the point where "everything is in the data." Spreadsheets are used to ease editing & keep things consistent & IDs are used everywhere as the primary things passed around in messaging. For these reasons & others it's staggeringly useful for us to use "recognizable" IDs. – gMale Mar 15 '11 at 05:55
5

Well, given your 10141 "Apple is in aisle four", what happens when you end up with product 10 in aisle 1 on shelf 41? Or is that product 1 in aisle 014 on shelf 1, or is it product 101 in aisle 41 sitting on the floor because it's not on a shelf?

Once you start co-mingling data like that, you generally lose any ability to reliably extract the components. Human-readable keys are all nice and dandy, but you never destroy the individual IDs the human form is based on.

Marc B
  • 356,200
  • 43
  • 426
  • 500
  • Yes, you've narrowed in on one of the main challenges of this approach. In our case, after spending an hour or so thinking it through, we identified conventions that fit all our data, with ample room for expansion. In that case, is it really so bad to have recognizable IDs? If they're supposed to be arbitrary, anyway, then nothing is inherently lost. The only "challenge" is maintenance--a problem easily solved by spreadsheets and formulas--not to mention also remedied by the fact that it's REALLY easy to identify incorrect IDs (additionally, most are caught by foreign key constraints). – gMale Feb 10 '11 at 02:51
4

What do you mean by "readable"? IDs are typically just numbers. And what do you mean by "consistent"? IDs are typically just incrementing numbers; you can't get much more consistent than that. Why waste time and effort trying to encode information into the ID, when the information will already be present explicitly in the database? Who would make use of "orderly" IDs?

Oliver Charlesworth
  • 267,707
  • 33
  • 569
  • 680
3

Meaningful ids are not against "db design policies"!

Quite the opposite, it's exactly what the real relational databases were about from the day one. If your data contains some combination of attributes that is - from the business point of view - unique, NOT making it an ID will usually break the Boyce-Codd normal form. And bring the anomalies that go with it.

Unless the information encoded in ID is redundant with what's in other fields, just use it. If it is redundant, make a multiple column primary key. They are not very handy with ORMs, but in data driven applications they are a blessing.

ADDENDUM: (after the edit of original question)

In your case, for a data driven application, I would do:

Type
==========
Fruit
Veggie

Product
==========
Apple    Fruit
Banana   Fruit
Orange   Fruit
Lettuce  Veggie
Onion    Veggie
Carrot   Veggie

Isle
==========
4
5

Shelf
==========
top
bottom

Location
==========
4   top
4   bottom
5   top
5   bottom

ProductLocation
==========
Apple    4  top
Banana   4  top

With such setup:

  • the data is normalized
  • you can SEE the location of any product in the ProductLocation table - you can even see the shelf
  • no surrogates
  • depending on types of queries, this structure can actually perform better than other propositions, because it requires less joins (or it may be slower, because it requires more storage).
  • this will work best with RDBMSs that support an "on replace update" constraint.
  • if you want to treat names as ids, you probably need to add some column like 'display name". That's because people want to change what's displayed more often than they want to change identity of thigns.
fdreger
  • 12,264
  • 1
  • 36
  • 42
  • +1 I like what you're saying here: don't use numeric IDs at all. Taking that approach would probably settle all the issues with our data. But... if a demand is made that numeric IDs must be used (I'm certain this will happen), then is it terribly wrong to make those IDs "recognizable" instead of arbitrary/auto-increment? Especially when, 1) our IDs never change 2) data is maintained in spreadsheets with lots of formulas that make keeping things consistent very easy. – gMale Feb 21 '11 at 15:28
  • 1
    @gmale: The problem comes from duplicating data. IDs sholud never contain data available in any other column or available by joining on a different key. It does not matter whether the duplicating column is a primary key or not or whether it's duplicated in a straightforward or complex way. – fdreger Feb 27 '11 at 20:57
3

Here is my take on Surrogate keys. (or ID's if you want to call them that)

Surrogate keys have no business meaning. They are used to uniquely identify the row. But they do more than merely identify the row. They are also the "soul" of the row. It can't be changed or traded. If the surrogate follows the "soul" principle then when you delete the row, a new row will never take the dead row's value. The soul still belongs to the deleted row even after it's dead and gone.

I like my surrogates to be "souls", though that is not required to be a surrogate.

The advantage of a surrogate is it never needs to change. If 30 other tables have a foriegn key to your main table, you don't want to update all 30 when the main table's PK changes. You can still have a CANDIDATE key on that potentially changing value, but since it can change it is not the row's soul.

Surrogate keys are often auto-increment integers. This lends itself PERFECTLY for clustered indexes. Your table joins will be as good as they can possibly be. Natural keys tend to make horrible clustered indexes as new values are rarely sequential. Integers are small, fixed length data types for even faster matching.

If your name changes, you're still you. If you burn off your finger prints, you're still you. God is using a surrogate key, so I think it's OK for us to use them in our databases.

EDIT After reading your question more carefully, I think you are actually using "meaningless keys" just in the wrong way.

You have value "10141" to represent the apple/location association. That is the combination of 2 surrogates into 1 field. Keep them as separate fields "101" and "41" and make the PK on the combo of those fields. Keeping them separate will make it easier to search, index, table join, etc.

You are right, you don't need yet another surrogate on the mapping table. The combo of 2 surrogates is a surrogate in it's own right (though not a soul). Just express the combo in 2 separate columns, not combined into 1 column. END EDIT

Lord Tydus
  • 544
  • 2
  • 7
  • @Tydus: I think you pinpoint a good solution: use a PK that is a combo of the important fields. In our case, those fields will often be foreign keys. So this still boils down to the same question: is it acceptable to intentionally make those foreign keys "recognizable?" – gMale Feb 21 '11 at 14:34
3

The three principal criteria for key design are Familiarity, Simplicity and Stability. If you use keys that are familiar and simple then users will find them easier to recognise, remember and use. They will be less likely to make mistakes when entering and using key values and data quality and usability will usually be improved.

I suggest you resolve this question by asking your users which type of identifier they find easier to use, or even testing out different schemes with them if it's very important to you. In any case, developers alone shouldn't make that decision. Some organisations have departments or individuals responsible for defining the standard coding schemes to be used.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • this answer focuses on who uses the data. I like that—in practical terms, it's the primary thing that matters. This helps me see that this is the root of the problem. In our case, the entire issue arises because the only people to ever see or care about the IDs are the ones testing/maintaining the application. Without exception, those people find it exponentially easier to do their jobs faster when they can readily make sense of the IDs, alone, without cross-referencing tables. The issue is, making those IDs "consistent" and "readable" is clearly considered the purest form of evil. – gMale Feb 21 '11 at 14:25
2

The ID may be meaningful to you but not necessarily the computer. The database software is not going to be intelligent enough to pick up on a pattern like that (unless you program it to do so, obviously, which will be more trouble than its worth) so really all you've done is create a potential problem for yourself down the road when there's a conflict with IDs that you didn't foresee.

I understand the point that you're trying to make but good database design involves making things as easy as possible for the database engine to write and read. You're better off setting indexes and studying your database performance to find areas where you can optimize.

Jarrod Nettles
  • 6,193
  • 6
  • 28
  • 46
  • So, when you write a program, your priority is to make it easy for the COMPUTER to read? obviously, you use some kind of assembler? – fdreger Feb 09 '11 at 22:42
  • @fdreger: My priority is making my *source code* easy for me to read, and my *data* easy for my computer to process. – Oliver Charlesworth Feb 09 '11 at 22:57
  • @Oli: what if your data *is* your source code. One of the trade offs of Spring and dependency injection, in general, is that it moves lots of code into XML files and/or databases. Either way, it takes things that used to have meaning in code and moves them into data. At that point, as a developer, you will need to make sense of your data in ways never before necessary. I'm starting to understand more of the arguments against the approach I've outlined but most of it seems steeped in tradition rather than provable arguments. Java's Spring is definitely making me re-evaluate how I use data. – gMale Feb 10 '11 at 02:32
  • @fdreger Were we talking about source code? Odd, I seem to recall this being a discussion about database optimization, in which case you should make every effort to make it easier for the computer to read. – Jarrod Nettles Feb 10 '11 at 05:09
  • 1
    @Jarrod: you recall wrong, it is not about optimalization (=making data processing fast). The original question is about consistency, maintainability and readibility. Making "every effort to make it easier for the computer to read" reduces the three qualities. For example using OOP or scripting languages makes it harder for the computer and easier for people. – fdreger Feb 10 '11 at 10:01
  • @Oli: You sacrifice your data access speed for readable source code every time you use PHP. The only data structure available in PHP is a linked hash array, yet you use it in places where an ordinary linked list would suffice. – fdreger Feb 10 '11 at 10:17
1

Zooko's Triangle and the concept of petnames may be relevant here.

zwol
  • 135,547
  • 38
  • 252
  • 361
1

I thought about making this a comment but I'm afraid it might be too involved.

I think the consensus opinion is that in general IDs shouldn't have meaning. Perhaps if you were to constrain your question more to the specifics of your scenario the opinion would be different?

Based on your comments it sounded like you are doing a data load from a spreadsheet and I assume you are using the meaningful IDs as a way of determining the relationships between different data?

Is there a reason why you can't let the database handle auto-increment IDs but let the users (developers?) define codes. This way you can maintain referential integrity via foreign keys and also be properly normalized. If you really need to have some sort of quick look at the data then you could have a computed column with some sort of naming convention. It might even be more meaningful for your needs?

e.g.

Code Description
==== ===========
F    Fruit
V    Veggie

Product Code Product Type Product Description
============ ============ ===================
AP           F            Apple
BA           F            Banana

Location Code Location Description
============= ====================
AFTS          Aisle four top shelf
AFBS          Aisle four bottom shelf


Product Code Location 
============ ========
AP           AFTS 
BA           AFTS

Actually Location can be further normalized to aisles and shelves but you get the idea.

When the data is inserted into the database, IDs are created for each record and the relationships can be determined by the code and the foreign keys can be set to the proper ID. Your application can then deal with the IDs without knowledge of the codes.

So the Product Location would look something like:

Product ID Location ID
========== ===========
1          1 
2          1

If you still need something more descriptive you could just do a join in your SQL to get the codes or create a Computed Column or your app could map the IDs to the codes from a cache.

e.g.

Product ID Location ID ProductCode_LocationCode
========== =========== ========================
1          1           AP_AFTS
2          1           BA_AFTS

That would be a bit of a performance hit and I still don't really see the point but maybe that helps you?

OK, that was way too long. :)

Randy Levy
  • 22,566
  • 4
  • 68
  • 94
  • Thank you for taking time to create this response. It illustrates your point very well. The primary problem with this is that it requires that the computed column be available in all places the ID is, or else cross-referencing would be required. The reality of the problem domain is that in MANY situations, you (the developer/debugger) are stuck looking at an ID (and only an ID) and are left wondering "what the heck does this represent?" Certain IDs, naturally, start to take on meaning as they're learned. What's so evil about facilitating that recognition, if it's easy to put in place? – gMale Feb 10 '11 at 02:24
  • 1
    @gmale: I'm not sure I've ever been stuck in a situation where all I had was an ID. – NotMe Feb 11 '11 at 02:09
  • 1
    @gmale: I think I know what you mean about the annoyance of cross-referencing. It's hard to read a row with columns like foo1_id, foo2_id, foo3_id, all containing arbitrary numbers that have no meaning until looked up. 1) Even with custom-tailored IDs, you are still only getting one slice of data about that object, and you still need to look it up unless you've got it memorized. 2) Views can be extremely helpful in these situations. – alexantd Feb 11 '11 at 23:57
0

I don't think it makes much difference. I always tend to reseed my id's when I have the opportunity, but that's just me. I suppose it would be useful have some order in your id's, if you refer to them in code [enum's for example], other than that I wouldn't worry about it.

Judder1981
  • 83
  • 2
  • 2
  • 6
  • I agree. I have worked with systems that are 'designed' this way. Their a pain to work with, so keeping the id's low seems a little more manageable. – Judder1981 Feb 09 '11 at 22:49