26

I'm trying to come up with a PostgreSQL schema for host data that's currently in an LDAP store. Part of that data is the list of hostnames a machine can have, and that attribute is generally the key that most people use to find the host records.

One thing I'd like to get out of moving this data to an RDBMS is the ability to set a uniqueness constraint on the hostname column so that duplicate hostnames can't be assigned. This would be easy if hosts could only have one name, but since they can have more than one it's more complicated.

I realize that the fully-normalized way to do this would be to have a hostnames table with a foreign key pointing back to the hosts table, but I'd like to avoid having everybody need to do joins for even the simplest query:

select hostnames.name,hosts.*
  from hostnames,hosts
 where hostnames.name = 'foobar'
   and hostnames.host_id = hosts.id;

I figured using PostgreSQL arrays could work for this, and they certainly make the simple queries simple:

select * from hosts where names @> '{foobar}';

When I set a uniqueness constraint on the hostnames attribute, though, it of course treats the entire list of names as the unique value instead of each name. Is there a way to make each name unique across every row instead?

If not, does anyone know of another data-modeling approach that would make more sense?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Lars Damerow
  • 363
  • 1
  • 3
  • 4
  • See also "indexing arrays" at [this question/answer](http://stackoverflow.com/q/4058731/287948), and [this issue for pg9.3 "Array ELEMENT foreign keys"](https://commitfest.postgresql.org/action/patch_view?id=900) – Peter Krauss May 12 '14 at 10:00

2 Answers2

38

The righteous path

You might want to reconsider normalizing your schema. It is not necessary for everyone to "join for even the simplest query". Create a VIEW for that.

Table could look like this:

CREATE TABLE hostname (
  hostname_id serial PRIMARY KEY
, host_id     int  REFERENCES host(host_id) ON UPDATE CASCADE ON DELETE CASCADE
, hostname    text UNIQUE
);

The surrogate primary key hostname_id is optional. I prefer to have one. In your case hostname could be the primary key. But many operations are faster with a simple, small integer key. Create a foreign key constraint to link to the table host.
Create a view like this:

CREATE VIEW v_host AS
SELECT h.*
     , array_agg(hn.hostname) AS hostnames
--   , string_agg(hn.hostname, ', ') AS hostnames  -- text instead of array
FROM   host h
JOIN   hostname hn USING (host_id)
GROUP  BY h.host_id;   -- works in v9.1+

Starting with pg 9.1, the primary key in the GROUP BY covers all columns of that table in the SELECT list. The release notes for version 9.1:

Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause

Queries can use the view like a table. Searching for a hostname will be much faster this way:

SELECT *
FROM   host h
JOIN   hostname hn USING (host_id)
WHERE  hn.hostname = 'foobar';

Provided you have an index on host(host_id), which should be the case as it should be the primary key. Plus, the UNIQUE constraint on hostname(hostname) implements the other needed index automatically.

In Postgres 9.2+ a multicolumn index would be even better if you can get an index-only scan out of it:

CREATE INDEX hn_multi_idx ON hostname (hostname, host_id);

Starting with Postgres 9.3, you could use a MATERIALIZED VIEW, circumstances permitting. Especially if you read much more often than you write to the table.

The dark side (what you actually asked)

If I can't convince you of the righteous path, here is some assistance for the dark side:

Here is a demo how to enforce uniqueness of hostnames. I use a table hostname to collect hostnames and a trigger on the table host to keep it up to date. Unique violations raise an exception and abort the operation.

CREATE TABLE host(hostnames text[]);
CREATE TABLE hostname(hostname text PRIMARY KEY);  --  pk enforces uniqueness

Trigger function:

CREATE OR REPLACE FUNCTION trg_host_insupdelbef()
  RETURNS trigger
  LANGUAGE plpgsql AS
$func$
BEGIN
   -- split UPDATE into DELETE & INSERT
   IF TG_OP = 'UPDATE' THEN
      IF OLD.hostnames IS DISTINCT FROM NEW.hostnames THEN -- keep going
      ELSE
         RETURN NEW;  -- exit, nothing to do
      END IF;
   END IF;

   IF TG_OP IN ('DELETE', 'UPDATE') THEN
      DELETE FROM hostname h
      USING  unnest(OLD.hostnames) d(x)
      WHERE  h.hostname = d.x;

      IF TG_OP = 'DELETE' THEN RETURN OLD;  -- exit, we are done
      END IF;
   END IF;

   -- control only reaches here for INSERT or UPDATE (with actual changes)
   INSERT INTO hostname(hostname)
   SELECT h
   FROM   unnest(NEW.hostnames) h;

   RETURN NEW;
END
$func$;

Trigger:

CREATE TRIGGER host_insupdelbef
BEFORE INSERT OR DELETE OR UPDATE OF hostnames ON host
FOR EACH ROW EXECUTE FUNCTION trg_host_insupdelbef();

SQL Fiddle with test run.

Use a GIN index on the array column host.hostnames and array operators to work with it:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • See also "indexing arrays" at [this question/answer](http://stackoverflow.com/q/4058731/287948), and [this issue for pg9.3 "Array ELEMENT foreign keys"](https://commitfest.postgresql.org/action/patch_view?id=900) – Peter Krauss May 12 '14 at 09:59
  • @PeterKrauss: "Array ELEMENT Foreign Keys" are [stalled since 2012 due to serious problems with operator compatibility and performance](http://www.postgresql.org/message-id/flat/28389.1351094795@sss.pgh.pa.us#28389.1351094795@sss.pgh.pa.us). So that's not in pg 9.3 and not in pg 9.4 either. – Erwin Brandstetter Nov 19 '14 at 16:13
8

In case anyone still needs what was in the original question:

CREATE TABLE testtable(
    id serial PRIMARY KEY,
    refs integer[],
    EXCLUDE USING gist( refs WITH && )
);

INSERT INTO testtable( refs ) VALUES( ARRAY[100,200] );
INSERT INTO testtable( refs ) VALUES( ARRAY[200,300] );

and this would give you:

ERROR:  conflicting key value violates exclusion constraint "testtable_refs_excl"
DETAIL:  Key (refs)=({200,300}) conflicts with existing key (refs)=({100,200}).

Checked in Postgres 9.5 on Windows.

Note that this would create an index using the operator &&. So when you are working with testtable, it would be times faster to check ARRAY[x] && refs than x = ANY( refs ).

P.S. Generally I agree with the above answer. In 99% cases you'd prefer a normalized schema. Please try to avoid "hacky" stuff in production.

volvpavl
  • 604
  • 8
  • 9
  • It turns out not work for string/text types: https://www.postgresql.org/message-id/CA+TgmobZhfRJNyz-fyw5kDtRurK0HjWP0vtP5fGZLE6eVSWCQw@mail.gmail.com Thank you for posting though!! I learned a ton about exclude constraints and indexes and operator classes digging into this. :) – vergenzt May 04 '17 at 19:45
  • Doesn't work for me on Linux Debian Stretch, PostgreSQL 10.3. Error: ERROR: data type integer[] has no default operator class for access method "gist". HINT: You must specify an operator class for the index or define a default operator class for the data type. – Enthusiasmus Apr 30 '18 at 15:02
  • 1
    @Enthusiasmus Probably you should check out [this](https://dba.stackexchange.com/questions/37351/) one. – volvpavl May 01 '18 at 17:16
  • @Enthusiasmus: this solution depends on the `gist__int_ops` operator class (which is an operator class on array of integer aka `integer[]` aka `_int4`) and can be provided by the `intarray` extension. The `btree_gist` extension linked by @volvpavl provides an operator class `gist_int_ops` which is an operator class on `integer` aka `int4` and doesn't help here. – tbussmann May 22 '20 at 17:30
  • @vergenzt: you can use this for `text[]` with help of a function that converts the text to an integer array with a hash function that produces integers. Some implementations are available in the [`hashlib`](https://github.com/markokr/pghashlib) extension (or by the intentionally undocumented internal function `hashtext()`): `CREATE FUNCTION hash_text_arr(text[]) RETURNS integer[] LANGUAGE sql IMMUTABLE AS 'SELECT array_agg(hashtext(v)) FROM unnest($1) v';` and an exclusion constraint `EXCLUDE USING gist( hash_text_arr(refs) WITH && )` but this becomes quite hacky and prone to hash collisions. – tbussmann May 22 '20 at 18:07
  • 1
    Please note that an exclusion constraint does only check against collisions with other rows not within a row. `INSERT INTO testtable( refs ) VALUES( ARRAY[400,400] );` would be possible. – tbussmann May 22 '20 at 18:09