Best practice to enforce uniqueness on column but allow some duplicates?

Question

Here is what I am trying to figure out: there should be a table to store authorizations for our new client management system, and every authorization has their unique identifier. This constraint would be pretty easy to translate to SQL, but unfortunately because of the slowness of bureaucracy, sometimes we need to create an entry with a placeholder ID (e.g., "temp") in order for the client to be able to start taking services.

What would be the best practice to enforce this conditional uniqueness constraint?

These are what I could come up with my limited experience:

Use partial indexing mentioned in the PostgreSQL manual (5.3.3. -> Example 11-3.). It also mentions that This is a particularly efficient approach when there are few successful tests and many unsuccessful ones. In our legacy DB that will be migrated, there are 130,000 rows and about 5 temp authorizations a month, but the whole table only grows by about 200 rows per year. Would this be the right approach? (I am also not sure what "efficient" means in this context.)
Create a separate table for the temp authorizations but then it would duplicate the table structure.
Define a unique constraint for a group of columns. An authorization is for a specific service for a certain time period issued to an individual.

EDIT:

I'm sorry I think my description of the authorization ID was a bit obscure: it is provided by a state department with the format of NMED012345678 and it is entered by hand. It is unique, but sometimes only provided at a later time for unknown reasons.

If `temp` is your only non-unique repeat value, I'd go for the `partial` `unique index`. — joanolo, Jul 01 '17 at 12:46
Can you not use unique "temp" values such as "temp01", "temp02"? Using the creation date as part of the identifier would minimize the need to manage duplicates among these. — rd_nielsen, Jul 01 '17 at 12:52
Much depends on the nature and exact requirements of / for your ID. Can it change? Is it numeric? Does it carry a meaning or can it be a plain serial column? — Erwin Brandstetter, Jul 01 '17 at 14:24
Have more than one flag that means "temp", and make them unique. For example, if the ID is between -1 and -999, then it's a temp flag. Now you can still have a unique index with no problems, unless you get a thousand temps. Depending on your workflow, you may have to allow temp ids to mutate into permanent IDs, and to cascade to all foreign keys that reference them. — Walter Mitty, Jul 01 '17 at 14:34
Thanks to @wildplasser's answer below, I found this short paper that also perfectly describes my issue. Documenting it here in case it helps someone else as well: https://www.slideshare.net/ronmorg1/natural-vssurrogate-keys — toraritte, Jul 03 '17 at 11:13
[Why is asking a question on "best practice" a bad thing?](https://meta.stackexchange.com/q/142353/266284) [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) Please don't insert "EDIT"s/"UPDATE"s, just make your post the best presentation as of edit time. Adding to something unclear doesn't make it clear. ) [ask] [Help] [mre] PS Comments are ephemeral & for post issues. Please if you want something relevant to hang around edit it into your post. — philipxy, Jul 18 '22 at 20:05

Erwin Brandstetter · Accepted Answer · 2017-07-02T03:51:59.047

There is a simple, fast and secure way:

Add a boolean column to mark temporary entries which is NULL by default, say:

temp bool DEFAULT NULL CHECK (temp)

The added check constraint disallows FALSE, only NULL or TRUE are possible. Storage cost for the default NULL value is typically ... nothing - unless there are no other NULL values in the row.

How much disk-space is needed to store a NULL value using postgresql DB?

The column default means you don't normally have to take care of the column. It's NULL by default (which is the default default anyway, I'm just being explicit here). You only need to mark the few exceptions explicitly.

Then create a partial unique index like:

CREATE UNIQUE INDEX tbl_unique_id_uni ON tbl (unique_id) WHERE temp IS NULL;

That only includes rows supposed to be unique. Index size is not increased at all. Be sure to add the predicate WHERE temp IS NULL to queries that are supposed to use the unique index.

Create unique constraint with null columns

Wow, so simple. I also just realized that this is basically the application of the textbook example provided the postgres example on partial indexing with the same caveat... https://www.postgresql.org/docs/9.6/static/indexes-partial.html#INDEXES-PARTIAL-EX2 ->`the indexed column and the column used in the predicate do not need to match.(...) keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index.` Thanks! — toraritte, Jul 02 '17 at 00:28

joanolo · Answer 2 · 2017-07-01T17:32:24.500

2

You can have several possibilities:

Make the temp identifiers unique; for instance, if they are automatically created (not entered manually) make them:
```
CREATE SEQUENCE temp_ids_seq ;  -- This done only once for the database
```
Whenever you need a new temporary id, issue
```
'temp' || nxtval('temp_ids_seq') AS id 
```
Use a partial index, assuming that the value which is allowed is temp
```
CREATE UNIQUE INDEX tbl_unique_idx ON tbl (id) WHERE (id IS DISTINCT FROM 'temp')
```
For the sake of efficiency, you probably would like to have, in those cases, also the complementary index:
```
CREATE INDEX tbl_temp_idx ON tbl (id) WHERE (id IS NOT DISTINCT FROM 'temp')
```
This last index will help queries seeking id = 'temp'.

edited Jul 01 '17 at 17:32

answered Jul 01 '17 at 13:38

joanolo

6,028
1
29
37

1

(1) wouldn't work as this column has to be entered manually but I failed to provide this information in my question - although thanks for sequence example, I just looked it up and learned something new. (2) would work because even though multiple people use their own convention in our dept, the `'temp'` single value could be enforced on the application level. I choose @Erwin's answer because of it's simplicity. Thanks again! – toraritte Jul 02 '17 at 00:36

score 1 · Answer 3 · answered Jul 01 '17 at 13:16

This is a bit long for a comment.

I think I would have an authorization table with a unique authorization. The authorization could then have two types: "approved" and "temporary". You could handle this with two columns.

However, I would probably have the authorization id as a serial column with the "approved" id being a field in the table. That table could have a unique constraint on it. You can use either a full unique constraint or a unique constraint with filtered values (Postgres allows multiple NULL values in a unique constraint, but the second is more explicit).

You can have the same process for the temporary authorizations -- using a different column. Presumably you have some mechanism for authorizing them and storing the approval date, time, and person.

I would not use two tables. Having authorizations spread among multiple tables just seems likely to sow confusion. Anywhere in the code where you want to see who has an authorization is a potential for mis-reading the data.

I am still trying to understand your suggestion but it seems like that @Erwin's answer is a generalization of your idea but again, I failed to mention that the authorization ID is manually entered. Is the `field` that you emphasized the same as a regular column or am I missing something? Thanks for the reassurance too that two table idea would have been terrible, I didn't like it either. — toraritte, Jul 02 '17 at 00:56
@toraritte . . . Erwin does a very good job explaining things in his answers. — Gordon Linoff, Jul 02 '17 at 01:06

score 1 · Answer 4 · answered Jul 02 '17 at 16:35

IMO it is not advisable to use remote keys as (part of) primary keys.

they are not under your control; they can change
you cannot guarantee correctness and/or uniqueness(email-addresses, telefone numbers, licence-numbers, serial numbers)
using them AS PK would cause them to be used AS FK for other tables into this table, with fat indexes and lots cascading on change.

\i tmp.sql

CREATE TABLE the_persons
        ( seq SERIAL NOT NULL PRIMARY KEY -- surrogate key
        , registrationnumber varchar -- "remote" KEY, not necesarily UNIQUE
        , is_validated BOOLEAN NOT NULL DEFAULT FALSE
        , last_name varchar
        , dob DATE
        );

CREATE INDEX name_dob_idx ON the_persons(last_name, dob)
        ;

CREATE UNIQUE INDEX registrationnumber_idx ON the_persons(registrationnumber,seq)
-- WHERE is_validated = False
        ;

CREATE UNIQUE INDEX registrationnumber_key ON the_persons(registrationnumber)
WHERE is_validated = True
        ;

INSERT INTO the_persons(is_validated,registrationnumber,last_name, dob)VALUES
 ( True, 'OKAY001', 'Smith', '1988-02-02')
,( True, 'OKAY002', 'Jones', '1988-02-02')
,( False, 'OKAY001', 'Smith', '1988-02-02')
,( False, 'OMG001', 'Smith', '1988-08-02')
        ;

-- validated records:
SELECT *
FROM the_persons
WHERE is_validated = True
        ;

-- some records with nasty cousins
SELECT *
FROM the_persons p
WHERE EXISTS (
        SELECT*
        FROM the_persons x
        WHERE x.registrationnumber = p.registrationnumber
        AND x.is_validated = False
        )
AND last_name LIKE 'Smith%'
        ;

Thanks for the explanation and especially for the example! Using the authorization numbers as PK didn't feel right but your summary made this hunch explicit. I am still trying to figure out why your last query returned all 3 Smith rows - even after I removed the last ANDing - but I think I know what point you were trying to make.(I had to look up the term `surrogate key` and now I am an hour into the surrogate vs. natural key rabbit hole...:) — toraritte, Jul 03 '17 at 10:51
The last query is a bit of a dirty trick: the exists() yields true for the unvalidated record *itself*, too (and the ones with the nasty cousins). WRT the theory: this method very much resembles the way *data-fault* manages *remote* keys. — wildplasser, Jul 03 '17 at 18:02

Best practice to enforce uniqueness on column but allow some duplicates?

4 Answers4