Is using JOINs to avoid numerical IDs a bad thing?

Question

Yesterday I was looking at queries like this:

SELECT <some fields>
FROM Thing
WHERE thing_type_id = 4

... and couldn't but think this was very "readable". What's '4'? What does it mean? I did the same thing in coding languages before but now I would use constants for this, turning the 4 in a THING_TYPE_AVAILABLE or some such name. No arcane number with no meaning anymore!

I asked about this on here and got answers as to how to achieve this in SQL.

I'm mostly partial to using JOINS with existing type tables where you have an ID and a Code, with other solutions possibly of use when there are no such tables (not every database is perfect...)

SELECT thing_id
FROM Thing
JOIN ThingType USING (thing_type_id)
WHERE thing_type_code IN ('OPENED', 'ONHOLD')

So I started using this on a query or two and my colleagues were soon upon me: "hey, you have literal codes in the query!" "Um, you know, we usually go with pks for that".

While I can understand that this method is not the usual method (hey, it wasn't for me either until now), is it really so bad?

What are the pros and cons of doing things this way? My main goal was readability, but I'm worried about performance and would like to confirm whether the idea is sound or not.

EDIT: Note that I'm not talking about PL/SQL but straight-up queries, the kind that usually starts with a SELECT.

EDIT 2: To further clarify my situation with fake (but structurally similar) examples, here are the tables I have:

Thing
------------------------------------------
thing_id | <attributes...> | thing_type_id
    1                              3
    4                              7
    5                              3

ThingType
--------------------------------------------------
thing_type_id | thing_type_code | <attributes...>
       3           'TYPE_C'         
       5           'TYPE_E'
       7           'TYPE_G'

thing_type_code is just as unique as thing_type_id. It is currently also used as a display string, which is a mistake in my opinion, but would be easily fixable by adding a thing_type_label field duplicating thing_type_code for now, and changeable at any time later on if needed.

Supposedly, filtering with thing_type_code = 'TYPE_C', I'm sure to get that one line which happens to be thing_type_id = 3. Joins can (and quite probably should) still be done with the numerical IDs.

The join is using primary keys what do your colleagues mean " we usually go with pks for that" - as for performance - measure it (it will not be that much more and it is easier to show the issue to the user) — mmmmmm, Feb 20 '14 at 12:53
The other way is toy put the string codes in the Thing table but still have a ThingType table for a foreign key - then select does not need a join — mmmmmm, Feb 20 '14 at 12:55
The main reason I tend to use the string values rather than the PK (when the PK is an identity column) is that I can run the same query on the dev environment before releasing it to the production server, I can't guarantee when I add the entries to the production database that the PK will be the same. — GarethD, Feb 20 '14 at 12:56
Do the primary keys ever change? If not then why not comment the script to say clearly what 4 refers to? — Dibstar, Feb 20 '14 at 12:57
Literals like `'OPENED', 'ONHOLD'` are more reliable than pks, which could be different on different servers (eg dev vs prod). Actually, using pks as literals is BAD code. They should be meaningless ids. — Bohemian, Feb 20 '14 at 12:57
@Dibstar: this was my first thought but then I was told comments made the query hard to read. — leokhorn, Feb 20 '14 at 13:23

Bohemian · Accepted Answer · 2014-02-20T13:15:23.360

3

Primary key values should not be coded as literals in queries.

The reasons are:

Relational theory says that PKs should not convey any meaning. Not even a specific identity. They should be strictly row identifiers and not relied upon to be a specific value
Due to operational reasons, PKs are often different in different environments (like dev, qa and prod), even for "lookup" tables

For these reasons, coding literal IDs in queries is brittle.

Coding data literals like 'OPENED' and 'ONHOLD' is GOOD practice, because these values are going to be consistent across all servers and environments. If they do change, changing queries to be in sync will be part of the change script.

edited Feb 20 '14 at 13:15

answered Feb 20 '14 at 13:05

Bohemian

412,405
93
575
722

Completely agree, with the caveat that this exclusively applies to surrogate keys, natural keys do convey meaning and there is no reason not to use them in queries (depending on which side you stand of the natural vs surrogate key debate). – GarethD Feb 20 '14 at 13:19
@GarethD I know what you mean, but if you try to use a natural key with hibernate, you just want the pain to stop. – Bohemian Feb 20 '14 at 13:22
This is very reassuring as this situation made me suddenly doubt all I thought I knew about good practices, or at least that it may not apply to SQL for some reason. – leokhorn Feb 20 '14 at 18:07
@leokhorn I have been bitten by this exact problem. Correct behaviour beats performance every time. – Bohemian Feb 20 '14 at 20:11
That's what I've read these past years: code something that works correctly first, then deal with performance issues if/when they occur and only if they're in critical parts of the process. I know I've lost time trying to optimize things that had nearly no impact on the whole. – leokhorn Feb 20 '14 at 21:02
"Relational theory says that PKs should not convey any meaning" Rubbish! Sorry, but relational theory says no such thing. – nvogel Feb 21 '14 at 08:18
@sqlvogel Well, maybe not natural keys, but surrogate keys (which these IDs are) should not have any meaning. If you encode info into the key (ie the key is partitioned in some way), you break 3NF, because the non-key attributes will be dependant on *part* of the key. See [this great SO answer](http://stackoverflow.com/questions/621884/621891#621891) for more, and [this](http://rapidapplicationdevelopment.blogspot.com.au/2007/08/in-case-youre-new-to-series-ive.html) and [this](http://www.agiledata.org/essays/keys.html) for just a couple of external links also saying essentially the same thing. – Bohemian Feb 21 '14 at 13:06
@Bohemian, I don't think much of those references and I suggest you read some more reliable material. To be clear, 3NF is not violated just because a key is "meaningful"; it would be if a non-key attribute happened to be dependent on part of a key. "Natural" keys are absolutely essential to normalization, data integrity and good database design. – nvogel Feb 21 '14 at 13:28
@sqlvogel Actually, natural keys are a terrible idea and should be avoided like the plague. They are *not* essential to anything at all (especially normalization and integrity), and further constitute a failure of design. It has been many, many years since I have even seen one used. And I have worked at some very large and successful companies. If reviewing a design, I would dismiss it immediately if I saw a natural key being used. I'm not kidding either. – Bohemian Feb 21 '14 at 13:36
If I saw a table that had a surrogate key but did not have a natural key as well then I would certainly want to reject that. That seems to be a relatively rare problem these days because most database designers know better. I've wasted enough time on projects cleaning up messes of duplicate data created because some natural key was not enforced. – nvogel Feb 21 '14 at 13:44
@sqlvogel just wanted to add... Natural keys are never used commercially. The reason is there is always an edge case or a business model changeb that brings down the entire data model. Gone are the days when the database was king. Now it's the application - the dadabasd must comply with the needs of the app, not the other way around. Further, no one cares any more about RI or any of the other dara centric notions of what's "important". Look at how NoSQL is taking off. I used to be a "data guy". Now I'm more pragmatic. If the database doesn't like something, I change the database. – Bohemian Feb 21 '14 at 13:52
@sqlvogel Responding to your last comment, about your natural key - if duplicates are an issue, put a unique constraint in on the attribute. But I can practically guarantee that if you use natural data as a key, one day you'll hit an instance that doesn't have data for the attribute you're using. Then you're sunk. Use a surrogate key and no problem. Adjust your model for that attribute and keep going. – Bohemian Feb 21 '14 at 13:57
A key IS a uniqueness constraint (in SQL it must be a non-nullable one of course) and a uniqueness constraint IS the technical implementation of a key! – nvogel Feb 21 '14 at 17:00
"Hey Mr Bank Manager, what happened to my money?" "Sorry about that sir, our database developers don't believe in RI or natural keys so we duplicated your account number and paid your money to the wrong person. No-one cares about data-centric notions of what's important do they?" – nvogel Feb 21 '14 at 17:00

Gordon Linoff · Answer 2 · 2014-02-20T13:49:21.480

I assume that the question is about the two versions of the query -- one with the numeric comparison and the other with the join and string comparison.

Your colleagues are correct that the form with where thing_id in (list of ids) will perform better than the join. The difference in performance, however, might be quite minor if thing_id is not indexed. The query will already require a full table scan on the original table.

In most other respects, your version with the join is better. In particular, it makes the intent of the query cleaner and overall make the query more maintainable. For a small reference table, the performance hit may not be noticeable. In fact, in some databases, this form could be faster. This would occur when the in is evaluated as a series of or expressions. If the list is long, it might be faster to do an index lookup.

There is one downside to the join approach. If the values in the columns change, then the code also needs to be changed. I wouldn't be surprised if your colleague who suggests using primary keys has had this experience. S/he is working on an application and builds it using joins. Great. Lots of code. All clear. All maintainable. Then every week, the users decide to change the definitions of the codes. That can make almost any sane person prefer primary keys over using the reference table.

I was told of the possibility of codes changing. When I asked if it ever happened or should happen at all, I didn't get a clear-cut "yes" though. Seems like adding a Description column should deal with users wanting to change the label of each entry. — leokhorn, Feb 20 '14 at 13:35
@leokhorn . . . I totally agree. Having a short code that doesn't change and longer forms for users is the way to go. — Gordon Linoff, Feb 20 '14 at 13:48
@leokhorn - there's a related aspect to this: you want the codes used to be _language independent_, too. Depending on domain, the relevant codes may already be defined for you - say, the ISO 3-letter country codes. Also, depending on business needs, a modified code might actually be a **new** entry (which gets you in both cases). Ideally, changing one of these codes should be seen as equivalent to changing actual computer code (because it usually will - even if it's just a configuration change). Definitely use user-definable longer values when possible, and config files for the query value — Clockwork-Muse, Feb 20 '14 at 13:52

jean · Answer 3 · 2014-02-20T17:38:16.390

1

See Mark comment. I assume you are ok but can give my 2 cents on matter. If that value is in the scope of one query I like to write that this, readable, way:

declare HOLD int = 4

SELECT <some fields>
FROM Thing
WHERE thing_type_id = HOLD

If that values are used many times in many points (queries, SP, views, etc) I create a domain table.

create table ThingType (id int not null primary key, varchar(50) description)
GO
insert into ThingType values (4,'HOLD'),(5, 'ONHOLD')
GO

that way i can reuse that types on my selects as an enumerator

declare TYPE int
set TYPE = (select id from ThingType where description = 'HOLD')

SELECT <some fields>
FROM Thing
WHERE thing_type_id = TYPE

that way I keep meaning and performance (and also can enforce relational integrity over domain values)

Also I can just use enumerator at app level and just pass numeric values to the queries. A quick glimpse in that enumerator ill give me that number meaning.

edited Feb 20 '14 at 17:38

answered Feb 20 '14 at 13:06

jean

4,159
4
31
52

Is this PLSQL? I'm getting errors on the "@" sign – leokhorn Feb 20 '14 at 13:30
Sorry about the @ I missed de Oracle tag and did it with T-SQL instead of PL/SQL. That's just a variable declaration. – jean Feb 20 '14 at 17:37
Thing is, unless PL/SQL can be integrated into a simple query and doesn't require to be a stored procedure/function on the database, I'd rather stick to regular SQL. From my research, I'm guessing variables don't really exist in base SQL though... – leokhorn Feb 20 '14 at 18:05
by "base" and "regular" sql do you mean ANSI sql? – jean Feb 20 '14 at 18:59
Not a 100% sure, but I think so. SQL without any procedural ability. No functions, flow-control... just selecting data from source, aggregating and filtering. As far as I can tell, the kind of SQL code you put in a query and not in a procedure. That said, I may be unaware that PL/SQL can be included in a query. I'm new to Oracle and stored procedures, though have experience in PostgreSQL. – leokhorn Feb 20 '14 at 20:58
Unfortunately ANSI SQL is not a weapon of choice. If you stick to it you ill be limited in what you actually can do and how you can do and worse: in the best way you can do. – jean Feb 20 '14 at 21:07

score 0 · Answer 4 · answered Feb 20 '14 at 12:56

In SQL queries you will definitely introduce a performance hit for JOINs (effectively multiple queries are taking place inside the SQL server). The question is whether the performance hit is significant enough to offset the benefits.

If it's just a readability thing then you may prefer to go for better performance and avoid the JOINs, but I would suggest you take into account potential integrity problems (e.g. what happens if the typed value of 4 in your example is changed by another process further down the line - the entire application may fail).

If the values will NEVER change then use PKs - this is a decision for you as the developer - there is no rule. One options may be best for one query and not for another.

score 0 · Answer 5 · answered Feb 20 '14 at 15:02

In case of PL/SQL it makes sense to define constants in your package, e.g.

DECLARE
   C_OPENED CONSTANT NUMBER := 3;
   C_ONHOLD CONSTANT NUMBER := 4;
BEGIN
SELECT <some fields>
INTO ...
FROM Thing
WHERE thing_type_id in (C_OPENED, C_ONHOLD);
END;

Sometime it is usefull to create global package (without a body) where all commonly used constants are defined. In case the literal changes, you only have to modify the constant definition at a single place.

Is using JOINs to avoid numerical IDs a bad thing?

5 Answers5