2

Since hashes are smaller than lengthy text, it seems to me that they could be preferred to b-trees for ensuring column uniqueness.

For the sole purpose of ensuring uniqueness, is there any reason the following isn't OK in PG 10?

CREATE TABLE test (
  path ltree,
  EXCLUDE USING HASH ((path::text) WITH =)
);

I assume hash collisions are internally dealt with. Would be useless otherwise.

I will use a GiST index for query enhancement.

IamIC
  • 17,747
  • 20
  • 91
  • 154

1 Answers1

5

I think quoting the manual on this sums it up:

Although it's allowed, there is little point in using B-tree or hash indexes with an exclusion constraint, because this does nothing that an ordinary unique constraint doesn't do better. So in practice the access method will always be GiST or SP-GiST.

All the more since you want to create a GiST index anyway. The exclusion constraint with USING GIST will create a matching GiST index as implementation detail automatically. No point in maintaining another, inefficient hash index not even being used in queries.

For simple uniqueness (WITH =) a plain UNIQUE btree index is more efficient. If your keys are very long, consider a unique index on a hash expression (using any immutable function) to reduce size. Like:

CREATE UNIQUE INDEX test_path_hash_uni_idx ON test (my_hash_func(path));

Related:

md5() would be a simple option as hash function.

Before Postgres 10 the use of hash indexes was discouraged. But with the latest updates, this has improved. Robert Haas (core developer) sums it up in a blog entry:

CREATE INDEX test_path_hash_idx ON test USING HASH (path);

Alas (missed that in my draft), the access method "hash" does not support unique indexes, yet. So I would still go with the unique index on a hash expression above. (But no hash function that reduces information can fully guarantee uniqueness of the key - which may or may not be an issue.)

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    One note: hash indexes do not support `UNIQUE`, so I'll need to use a C hash function and non-unique b-tree index with a `BEFORE INSERT TRIGGER` that detects hash AND field equality. – IamIC Dec 26 '17 at 10:26
  • @IamIC: Thanks for pointing out. My last addition was too optimistic. But anyway, the unique (btree) expression index using your hash function would seem simpler / faster / more reliable. – Erwin Brandstetter Dec 26 '17 at 10:31
  • Hash function + trigger seems very worthwhile to me, especially when the source is of considerable length. Please note that the index cannot be unique in this case as there may be collisions, even with MD5. – IamIC Dec 26 '17 at 10:35
  • I agree that MD5 would be simpler in that it wouldn't require a custom C function. However, xxHash (for example) is considerably faster (which may or may not be meaningless) and is either 32 or 64-bit (which may be advantageous space and performance-wise over MD5). – IamIC Dec 26 '17 at 10:36
  • 1
    @IamIC: `md5()` is just a simple (good) example. It's not perfect, the cryptographic component adds some cost that may not be needed ... Your hash function may very well be superior. The cost is not *too* important, since it's not re-executed in select queries, the index is based on the *result*. – Erwin Brandstetter Dec 26 '17 at 10:41
  • I agree. Pragmatically, MD5 wins. – IamIC Dec 26 '17 at 10:44
  • 1
    With `md5()` (or other functions producing compatible results), consider casting to `uuid` like [demonstrated in the linked answer](https://stackoverflow.com/a/8335376/939860). – Erwin Brandstetter Dec 26 '17 at 10:49
  • I agree. Much more space efficient. I learned that from you a while ago :) – IamIC Dec 26 '17 at 10:50
  • Thanks much, Erwin! Merry Christmas to you, too! :) – IamIC Dec 26 '17 at 10:54
  • 2
    The `EXCLUDE USING HASH` solution seems to have some advantages: 1. UNIQUE btree index on a fast and small hash function will cause false errors on collisions. 2. Practically collision-free hash functions might (?) cause noticeable slowdown 3. `EXCLUDE` creates a simple hash index, which can also be used directly by queries. In contrast, the btree index on the hash function will only be used by queries which explicitly do something like `where md5(ltree1) = md5(ltree2) and ltree1 = ltree2`. Edit: My comment was about the case where the values are too large for a plain UNIQUE btree. – FunctorSalad Mar 05 '19 at 15:18
  • @FunctorSalad: The OP wrote: `I assume hash collisions are internally dealt with.` *Are they?* If so, It'd appear you have a point for values too large for btree indexes. But then why does hash not support `UNIQUE` indexes? – Erwin Brandstetter Mar 06 '19 at 01:04
  • 1
    @ErwinBrandstetter: Assuming hash indexes use the `hash$type` family of functions (not sure), [it does seem do deal with collisions internally](https://www.db-fiddle.com/f/hxQZjZmCSKdQiBQJuKoq2S/2). I would also expect this behavior because we were semantically asking Postgres for exclusion on the `=` operator, not on the hashes. – FunctorSalad Mar 06 '19 at 12:44
  • You really seem to have a point. Still confusing that (a) hash indexes do not support `UNIQUE`, and (b) the manual states that `... this does nothing that an ordinary unique constraint doesn't do better.` (You might want to start a separate question for this.) I posted the case to pgsql-general: https://www.postgresql.org/message-id/CAGHENJ6NT9OaAyOKykuZrss8k0rqM8%2BDsL18WBJuynAgXyk3Cw%40mail.gmail.com – Erwin Brandstetter Mar 06 '19 at 18:05