21

We have a legacy database schema that has some interesting design decisions. Until recently, we have only supported Oracle and SQL Server, but we are trying to add support for PostgreSQL, which has brought up an interesting problem. I have searched Stack Overflow and the rest of the internet and I don't believe this particular situation is a duplicate.

Oracle and SQL Server both behave the same when it comes to nullable columns in a unique constraint, which is to essentially ignore the columns that are NULL when performing the unique check.

Let's say I have the following table and constraint:

CREATE TABLE EXAMPLE
(
    ID TEXT NOT NULL PRIMARY KEY,
    FIELD1 TEXT NULL,
    FIELD2 TEXT NULL,
    FIELD3 TEXT NULL,
    FIELD4 TEXT NULL,
    FIELD5 TEXT NULL,
    ...
);

CREATE UNIQUE INDEX EXAMPLE_INDEX ON EXAMPLE
(
    FIELD1 ASC,
    FIELD2 ASC,
    FIELD3 ASC,
    FIELD4 ASC,
    FIELD5 ASC
);

On both Oracle and SQL Server, leaving any of the nullable columns NULL will result in only performing a uniqueness check on the non-null columns. So the following inserts can only be done once:

INSERT INTO EXAMPLE VALUES ('1','FIELD1_DATA', NULL, NULL, NULL, NULL );
INSERT INTO EXAMPLE VALUES ('2','FIELD1_DATA','FIELD2_DATA', NULL, NULL,'FIELD5_DATA');
-- These will succeed when they should violate the unique constraint:
INSERT INTO EXAMPLE VALUES ('3','FIELD1_DATA', NULL, NULL, NULL, NULL );
INSERT INTO EXAMPLE VALUES ('4','FIELD1_DATA','FIELD2_DATA', NULL, NULL,'FIELD5_DATA');

However, because PostgreSQL (correctly) adheres to the SQL Standard, those insertions (and any other combination of values as long as one of them is NULL) will not throw an error and be inserted correctly no problem. Unfortunately, because of our legacy schema and the supporting code, we need PostgreSQL to behave the same as SQL Server and Oracle.

I am aware of the following Stack Overflow question and its answers: Create unique constraint with null columns. From my understanding, there are two strategies to solve this problem:

  1. Create partial indexes that describe the index in cases where the nullable columns are both NULL and NOT NULL (which results in exponential growth of the number of partial indexes)
  2. Use COAELSCE with a sentinel value on the nullable columns in the index.

The problem with (1) is that the number of partial indexes we'd need to create grows exponentially with each additional nullable column we'd like to add to the constraint (2^N if I am not mistaken). The problems with (2) are that sentinel values reduces the number of available values for that column and all of the potential performance problems.

My question: are these the only two solutions to this problem? If so, what are the tradeoffs between them for this particular use case? A good answer would discuss the performance of each solution, the maintainability, how PostgreSQL would utilize these indexes in simple SELECT statements, and any other "gotchas" or things to be aware of. Keep in mind that 5 nullable columns was only for an example; we have some tables in our schema with up to 10 (yes, I cry every time I see it, but it is what it is).

Community
  • 1
  • 1
John Drouhard
  • 1,209
  • 2
  • 12
  • 18
  • 1
    There's a third option. Instead of actually inserting the "sentinel value", you could use a function-based index: `CREATE UNIQUE INDEX u ON example(COALESCE(field1, '~~~~NULL'), COALESCE(field2, '~~~~NULL'))`. It'll be worse, but it's worth considering – Lukas Eder May 10 '15 at 19:47
  • Yeah, that's actually what I meant by option number two. Use `COALESCE` as a functional index. Seems like the performance in that case would be bad and the sentinel value used couldn't actually be inserted if another row was `NULL` for that column already. – John Drouhard May 10 '15 at 19:51
  • OK, I see. Well, then option 3 is to actually insert the sentinel value (e.g. via a trigger or via a converter in your accessing application). That would be fast, but it would be ugly.... by the way, you have another compatibility issue. In Oracle, `'' IS NULL` is true :) – Lukas Eder May 10 '15 at 19:53
  • Maybe there is something wrong with your example (except for the missing semicolons and the `"`-quoting of string constants). If I submit the above inserts twice in PG, the second pair fails. (what should happen instead?) – wildplasser May 10 '15 at 21:52
  • I'll edit the question, but do the second insert with a different value for the key. It'll succeed when it shouldn't. – John Drouhard May 10 '15 at 22:01
  • Sorry, I was confused by the `text` primary key. – wildplasser May 10 '15 at 22:39

4 Answers4

14

Postgres 15 adds the clause NULLS NOT DISTINCT

See:

The solution is very simple now:

ALTER TABLE example ADD CONSTRAINT foo
UNIQUE NULLS NOT DISTINCT (field1, field2, field3, field4, field5);

For Postgres 14 or older

You are striving for compatibility with your existing Oracle and SQL Server implementations.

Since Oracle does not implement NULL values at all in row storage, it can't tell the difference between an empty string and NULL anyway. So wouldn't it be prudent to use empty strings ('') instead of NULL values in Postgres as well - for this particular use case?

Define columns included in the unique constraint as NOT NULL DEFAULT '', problem solved:

CREATE TABLE example (
   example_id serial PRIMARY KEY
 , field1 text NOT NULL DEFAULT ''
 , field2 text NOT NULL DEFAULT ''
 , field3 text NOT NULL DEFAULT ''
 , field4 text NOT NULL DEFAULT ''
 , field5 text NOT NULL DEFAULT ''
 , CONSTRAINT foo UNIQUE (field1, field2, field3, field4, field5)
);

Notes

What you demonstrate in the question is a unique index:

CREATE UNIQUE INDEX ...

Not the unique constraint you keep talking about. There are subtle, important differences!

I changed that to an actual constraint like in the title of the question.

The keyword ASC is just noise, since that is the default sort order. I dropped it.

Using a serial PK column for simplicity which is totally optional but typically preferable to numbers stored as text.

Working with it

Just omit empty / null fields from the INSERT:

INSERT INTO example(field1) VALUES ('F1_DATA');
INSERT INTO example(field1, field2, field5) VALUES ('F1_DATA', 'F2_DATA', 'F5_DATA');

Repeating any of theses inserts would violate the unique constraint.

Or if you insist on omitting target columns (which is a bit of anti-pattern in persisted INSERT statements),
or for bulk inserts where all columns need to be listed:

INSERT INTO example VALUES
  ('1', 'F1_DATA', DEFAULT, DEFAULT, DEFAULT, DEFAULT)
, ('2', 'F1_DATA','F2_DATA', DEFAULT, DEFAULT,'F5_DATA')
;

Or simply:

INSERT INTO example VALUES
  ('1', 'F1_DATA', '', '', '', '')
, ('2', 'F1_DATA','F2_DATA', '', '','F5_DATA')
;

Or you can write a trigger BEFORE INSERT OR UPDATE that converts NULL to ''.

Alternative solutions

If you need to use actual NULL values I would suggest the unique index with COALESCE like you mentioned as option (2) and @wildplasser provided as his last example.

The index on an array like @Rudolfo presented is simple, but considerably more expensive. Array handling isn't very cheap in Postgres and there is an array overhead similar to that of a row (24 bytes):

Arrays are limited to columns of the same data type. You could cast all columns to text if some are not, but it will typically further increase storage requirements. Or you could use a well-known row type for heterogeneous data types ...

A corner case: array (or row) types with all NULL values are considered equal (!), so there can only be 1 row with all involved columns NULL. May or may not be as desired. If you want to disallow all columns NULL:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • We use ODBC to interact with our databases, and we bind NULL when the string is empty. But because Oracle already treats them the same, we basically went with my original COALESCE option with a sentinel value of ''. For our int columns, we are rewriting our business logic to allow ourselves to remove nullable int columns from unique indexes entirely by making them not nullable. Thanks for the answer! – John Drouhard May 22 '15 at 01:16
8

Third method: use IS NOT DISTINCT FROM insted of = for comparing the key columns. (This could make use of the existing index on the candidate natural key) Example (look at the last column)

SELECT *
    , EXISTS (SELECT * FROM example x
     WHERE x.FIELD1 IS NOT DISTINCT FROM e.FIELD1
     AND x.FIELD2 IS NOT DISTINCT FROM e.FIELD2
     AND x.FIELD3 IS NOT DISTINCT FROM e.FIELD3
     AND x.FIELD4 IS NOT DISTINCT FROM e.FIELD4
     AND x.FIELD5 IS NOT DISTINCT FROM e.FIELD5
     AND x.ID <> e.ID
    ) other_exists
FROM example e
    ;

Next step would be to put that into a trigger function, and put a trigger on it. (don't have the time now, maybe later)


And here is the trigger-function (which is not perfect yet, but appears to work):


CREATE FUNCTION example_check() RETURNS trigger AS $func$
BEGIN
    -- Check that empname and salary are given
    IF EXISTS (
     SELECT 666 FROM example x
     WHERE x.FIELD1 IS NOT DISTINCT FROM NEW.FIELD1
     AND x.FIELD2 IS NOT DISTINCT FROM NEW.FIELD2
     AND x.FIELD3 IS NOT DISTINCT FROM NEW.FIELD3
     AND x.FIELD4 IS NOT DISTINCT FROM NEW.FIELD4
     AND x.FIELD5 IS NOT DISTINCT FROM NEW.FIELD5
     AND x.ID <> NEW.ID
            ) THEN
        RAISE EXCEPTION 'MultiLul BV';
    END IF;


    RETURN NEW;
END;
$func$ LANGUAGE plpgsql;

CREATE TRIGGER example_check BEFORE INSERT OR UPDATE ON example
  FOR EACH ROW EXECUTE PROCEDURE example_check();

UPDATE: a unique index can sometimes be wrapped into a constraint (see postgres-9.4 docs, final example ) You do need to invent a sentinel value; I used the empty string '' here.


CREATE UNIQUE INDEX ex_12345 ON example
        (coalesce(FIELD1, '')
        , coalesce(FIELD2, '')
        , coalesce(FIELD3, '')
        , coalesce(FIELD4, '')
        , coalesce(FIELD5, '')
        )
        ;

ALTER TABLE example
        ADD CONSTRAINT con_ex_12345
        USING INDEX ex_12345;

But the "functional" index on coalesce() is not allowed in this construct. The unique index (OP's option 2) still works, though:


ERROR:  index "ex_12345" contains expressions
LINE 2:  ADD CONSTRAINT con_ex_12345
             ^
DETAIL:  Cannot create a primary key or unique constraint using such an index.
INSERT 0 1
INSERT 0 1
ERROR:  duplicate key value violates unique constraint "ex_12345"
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Is your trigger implementation concurrent-safe? I think you're prone to a race condition issue here. – zerkms May 10 '15 at 23:33
  • I suppose it is. The trigger is part of the current transaction, so it should be just as concurrent-safe as the insert/update itself. (maybe it could be hooked up via (a) check constraint(s)) Should it be an `AFTER` trigger instead? – wildplasser May 10 '15 at 23:36
  • 1
    Well, from what I see you can concurrently insert 2 rows that would violate this trigger: the checks would pass successfully simultaneously. "Should it be an AFTER trigger instead? " --- I'm sure you cannot provide a proper solution using triggers at all (unless you use explicit locking to manage concurrent access) – zerkms May 10 '15 at 23:40
  • Unfortunately I need the behavior to be concurrent-safe. Otherwise, I appreciate the clever approach to the problem. Would adding explicit locking like @zerkms suggests kill the performance? If it's feasible, an example would be great. – John Drouhard May 11 '15 at 15:09
  • @JohnDrouhard it depends on the throughput – zerkms May 11 '15 at 21:17
  • Yeah, for now, we are essentially using the `UNIQUE INDEX` exactly how you have in your updated example, without adding the actual `CONSTRAINT`. Does explicitly adding the constraint change the behavior at all? – John Drouhard May 14 '15 at 15:50
  • No, the embedding of the index into a constraint fails (see the error message) But the index still exists (and prevents multiple PK=NULL entries) But it is the same as your solution#2. BTW: other solutions might exist, such as maintaining a key-only version of the PK in a shadow table via triggers, but in fact this shadow-table would be equivalent to the COALESCED 5-key-index. – wildplasser May 14 '15 at 16:17
4

This actually worked well for me:

CREATE UNIQUE INDEX index_name ON table_name ((
   ARRAY[field1, field2, field3, field4]
));

I don't know about how performance is affected, but it should be close to ideal (depending on how well optimized arrays are in postres)

0

You can create a rule to insert ALL NULL values instead of original table to partitions like partition_field1_nullable, partition_fiend2_nullable, etc. This way you create a unique index on original table only (with no nulls). This will allow you to insert not null only to orig table (having uniqness), and as many not null (and not unique accordingly) values to "nullable partitions". And you can apply COALESCE or trigger method against nullable partitions only, to avoid many scattered partial indexes and trigger against every DML on original table...

Vao Tsun
  • 47,234
  • 13
  • 100
  • 132
  • @JohnDrouhard well, idea was to create partitions - one for not nullable data, one for nullable. then create a unique index on not nullable. and function based index for nullable as a_horse_with_no_name offered. this way and DML and selects would run faster – Vao Tsun May 13 '15 at 09:10