2

I tried to add the following table in PostgreSQL 14.3:

CREATE TABLE client_cache (
    id            BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    request       VARCHAR COMPRESSION lz4 NOT NULL CHECK (LENGTH (request) <= 10240),
    request_body  BYTEA COMPRESSION lz4 NOT NULL CHECK (LENGTH (request_body) <= 1048576),
    request_hash VARCHAR GENERATED ALWAYS AS (MD5(ROW(request::BYTEA, request_body)::VARCHAR)) STORED
);

But Postgres complains:

[42P17] ERROR: generation expression is not immutable

I've seen many answers discussing how to create a GENERATED column containing the MD5 of a single column, but as soon as you add ROW() to calculate MD5 over multiple columns, the expression is no longer immutable.

I can create a GENERATED column using ROW(MD5(A), MD5(B)) but not MD5(ROW(A, B)).

What can I do instead to create a single MD5 value over multiple columns of varying types (as seen above)?

I know that I can create a view or populate a column using a trigger, but I'd really like to stick to a GENERATED column if possible.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Gili
  • 86,244
  • 97
  • 390
  • 689
  • I think I know what's going on. There are multiple nested issues. But first, please describe your actual objective. Be precise, details matter. Why `request::BYTEA`? Seems pointless. Do you *need* a 16-byte hash, or is a `bigint` hash (8 bytes) good enough? – Erwin Brandstetter Jun 02 '22 at 00:44
  • @ErwinBrandstetter `request_body` represents an HTTP request body which may or may not be binary. That's why I opted for `BYTEA`. The real table contains the corresponding HTTP response and body but I omitted it for brevity. The cast is probably useless so you can drop it. As for the hash, I'd like to represent a real MD5 value, so I'd prefer to stick with 16 bytes. – Gili Jun 02 '22 at 00:56
  • @ErwinBrandstetter to clarify, the http response columns are not part of the hash or generated column so it shouldn't affect anything. – Gili Jun 02 '22 at 01:05
  • @ErwinBrandstetter I just found a solution that works but is not ideal: `GENERATED ALWAYS AS (MD5(request || MD5(request_body))) STORED`. It's not ideal because I'm computing two MD5s per column, but at least now the output structure is better defined. Do you know of a way to clean this up further? – Gili Jun 02 '22 at 01:20

2 Answers2

3

I suggest an immutable helper function:

CREATE OR REPLACE FUNCTION f_request_md5(_request text, _request_body bytea)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE PARALLEL SAFE AS 
'SELECT md5(textin(record_out((md5(_request_body), _request))))::uuid';

And a table like this:

CREATE TABLE client_cache (
  id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
, request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
, request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
, request_hash uuid   GENERATED ALWAYS AS (f_request_md5(request, request_body)) STORED
);

db<>fiddle here

Note the more efficient uuid instead of varchar. See:

Background

There are two overloaded variants of md5() in Postgres 14 (or any supported version):

test=> SELECT (proargtypes::regtype[])[0], prorettype::regtype, provolatile
test-> FROM   pg_proc
test-> WHERE  proname = 'md5';
 proargtypes | prorettype | provolatile 
-------------+------------+-------------
 bytea       | text       | i
 text        | text       | i
(2 rows)

One takes bytea, one text, both are IMMUTABLE and return text. So this expression is immutable:

ROW(MD5(request), MD5(request_body))

But this is not, like you found out the hard way:

MD5(ROW(A, B)::varchar)

The text representation of a record is not immutable. There are many reasons. One obvious reason for the case at hand: bytea output can be in (default) hex format or in the obsolescent escape format. A plain

SET bytea_output = 'escape'; 

... would break your generated column.

To get an immutable text representation of a bytea value, you'd run it through encode(request_body, 'hex'). But don't go there. md5(request_body) gives us a faster immutable text "representation" for our purpose.

We still can't cast a record. So I created the wrapper function. Be sure to read this related answer for more explanation:

Like discussed in that answer, the new, built-in function hash_record_extended() would be much more efficient for the purpose. So if a bigint is good enough, consider this:

CREATE TABLE client_cache2 (
  id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
, request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
, request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
, request_hash bigint GENERATED ALWAYS AS (hash_record_extended((request, request_body), 0)) STORED
);

same db<>fiddle here

Works out of the box in Postgres 14 or later.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Upvoted! Thank you for the excellent explanation of what is going on under the hood. The suggestion of using `uuid` is also a good one. How does this compare to the answer I posted below? To my naïve mind that solution is shorter, does not require a helper function, and is potentially faster since we compute MD5 only once. Although my columns are all `NOT NULL` it would be great to find a solution that represents `NULL` and empty string differently just in case someone else needs it. – Gili Jun 02 '22 at 01:33
  • @Gili: Like I explained: the cast to `varchar` is not immutable and can produce a different result in a different session. See: *db<>fiddle [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=685441267d1632c0e2d5643a57dc84cf)* Plus, concatenation is also unsafe, as two different rows might concatenate to the same. Plus also expensive. I expect my function to be faster - besides doing the right thing. – Erwin Brandstetter Jun 02 '22 at 01:43
  • Excellent, albeit surprising, answer. I agree with you that the cast from `BYTEA` to `VARCHAR`, isn't safe per your examples. The odd thing is that postgres happily accepts my code as IMMUTABLE. Is this a bug on their end? Also, what is the probability of a collision using `hash_record_extended`? I know that for 16-bits of MD5 it's approximately `1.47*10^-29`. – Gili Jun 02 '22 at 01:56
  • Here is an even more revealing demo with a `UNIQUE` index broken by this deceiving cast: *db<>fiddle [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=e54a72ccb8bc17a048b66398d0ec47d6)*. Is it a bug? Well, it's a grey area. Postgres might be more restrictive on this one, as it's very restrictive with immutability otherwise. Then again, Postgres also trusts user functions *faking* immutability ... and that's mighty useful at times. – Erwin Brandstetter Jun 02 '22 at 02:01
  • @Gili: *Also, what is the probability of a collision using hash_record_extended?* Like I said: *Be sure to read this related answer for more explanation:* https://dba.stackexchange.com/a/299107/3684 It addresses your question in-depth. – Erwin Brandstetter Jun 02 '22 at 02:14
  • Finally an answer that makes sense! – user528025 Aug 20 '22 at 07:01
0

I think I found a solution!

Postgres doesn't like:

request_hash VARCHAR GENERATED ALWAYS AS (MD5(ROW(request, request_body)::VARCHAR)) STORED but

request_hash VARCHAR GENERATED ALWAYS AS (MD5(request || request_body::VARCHAR)) STORED works just fine.

Hooray!

Gili
  • 86,244
  • 97
  • 390
  • 689
  • See discussion why I wouldn't advise to do that: https://stackoverflow.com/questions/72468749/how-to-create-a-generated-column-containing-the-md5-of-multiple-columns/72469657#comment128021122_72469657 – Erwin Brandstetter Jun 02 '22 at 02:21