How to create a GENERATED column containing the MD5 of multiple columns?

Question

I tried to add the following table in PostgreSQL 14.3:

CREATE TABLE client_cache (
    id            BIGINT PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
    request       VARCHAR COMPRESSION lz4 NOT NULL CHECK (LENGTH (request) <= 10240),
    request_body  BYTEA COMPRESSION lz4 NOT NULL CHECK (LENGTH (request_body) <= 1048576),
    request_hash VARCHAR GENERATED ALWAYS AS (MD5(ROW(request::BYTEA, request_body)::VARCHAR)) STORED
);

But Postgres complains:

[42P17] ERROR: generation expression is not immutable

I've seen many answers discussing how to create a GENERATED column containing the MD5 of a single column, but as soon as you add ROW() to calculate MD5 over multiple columns, the expression is no longer immutable.

I can create a GENERATED column using ROW(MD5(A), MD5(B)) but not MD5(ROW(A, B)).

What can I do instead to create a single MD5 value over multiple columns of varying types (as seen above)?

I know that I can create a view or populate a column using a trigger, but I'd really like to stick to a GENERATED column if possible.

I think I know what's going on. There are multiple nested issues. But first, please describe your actual objective. Be precise, details matter. Why `request::BYTEA`? Seems pointless. Do you *need* a 16-byte hash, or is a `bigint` hash (8 bytes) good enough? — Erwin Brandstetter, Jun 02 '22 at 00:44
@ErwinBrandstetter `request_body` represents an HTTP request body which may or may not be binary. That's why I opted for `BYTEA`. The real table contains the corresponding HTTP response and body but I omitted it for brevity. The cast is probably useless so you can drop it. As for the hash, I'd like to represent a real MD5 value, so I'd prefer to stick with 16 bytes. — Gili, Jun 02 '22 at 00:56
@ErwinBrandstetter to clarify, the http response columns are not part of the hash or generated column so it shouldn't affect anything. — Gili, Jun 02 '22 at 01:05
@ErwinBrandstetter I just found a solution that works but is not ideal: `GENERATED ALWAYS AS (MD5(request || MD5(request_body))) STORED`. It's not ideal because I'm computing two MD5s per column, but at least now the output structure is better defined. Do you know of a way to clean this up further? — Gili, Jun 02 '22 at 01:20

Erwin Brandstetter · Accepted Answer · 2022-06-02T13:06:20.277

I suggest an immutable helper function:

CREATE OR REPLACE FUNCTION f_request_md5(_request text, _request_body bytea)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE PARALLEL SAFE AS 
'SELECT md5(textin(record_out((md5(_request_body), _request))))::uuid';

And a table like this:

CREATE TABLE client_cache (
  id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
, request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
, request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
, request_hash uuid   GENERATED ALWAYS AS (f_request_md5(request, request_body)) STORED
);

db<>fiddle here

Note the more efficient uuid instead of varchar. See:

What is the optimal data type for an MD5 field?

Background

There are two overloaded variants of md5() in Postgres 14 (or any supported version):

test=> SELECT (proargtypes::regtype[])[0], prorettype::regtype, provolatile
test-> FROM   pg_proc
test-> WHERE  proname = 'md5';
 proargtypes | prorettype | provolatile 
-------------+------------+-------------
 bytea       | text       | i
 text        | text       | i
(2 rows)

One takes bytea, one text, both are IMMUTABLE and return text. So this expression is immutable:

ROW(MD5(request), MD5(request_body))

But this is not, like you found out the hard way:

MD5(ROW(A, B)::varchar)

The text representation of a record is not immutable. There are many reasons. One obvious reason for the case at hand: bytea output can be in (default) hex format or in the obsolescent escape format. A plain

SET bytea_output = 'escape';

... would break your generated column.

To get an immutable text representation of a bytea value, you'd run it through encode(request_body, 'hex'). But don't go there. md5(request_body) gives us a faster immutable text "representation" for our purpose.

We still can't cast a record. So I created the wrapper function. Be sure to read this related answer for more explanation:

Why doesn't my UNIQUE constraint trigger?

Like discussed in that answer, the new, built-in function hash_record_extended() would be much more efficient for the purpose. So if a bigint is good enough, consider this:

CREATE TABLE client_cache2 (
  id           bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY
, request      text   COMPRESSION lz4 NOT NULL CHECK (length(request) <= 10240)
, request_body bytea  COMPRESSION lz4 NOT NULL CHECK (length(request_body) <= 1048576)
, request_hash bigint GENERATED ALWAYS AS (hash_record_extended((request, request_body), 0)) STORED
);

same db<>fiddle here

Works out of the box in Postgres 14 or later.

Upvoted! Thank you for the excellent explanation of what is going on under the hood. The suggestion of using `uuid` is also a good one. How does this compare to the answer I posted below? To my naïve mind that solution is shorter, does not require a helper function, and is potentially faster since we compute MD5 only once. Although my columns are all `NOT NULL` it would be great to find a solution that represents `NULL` and empty string differently just in case someone else needs it. — Gili, Jun 02 '22 at 01:33
@Gili: Like I explained: the cast to `varchar` is not immutable and can produce a different result in a different session. See: *db<>fiddle [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=685441267d1632c0e2d5643a57dc84cf)* Plus, concatenation is also unsafe, as two different rows might concatenate to the same. Plus also expensive. I expect my function to be faster - besides doing the right thing. — Erwin Brandstetter, Jun 02 '22 at 01:43
Excellent, albeit surprising, answer. I agree with you that the cast from `BYTEA` to `VARCHAR`, isn't safe per your examples. The odd thing is that postgres happily accepts my code as IMMUTABLE. Is this a bug on their end? Also, what is the probability of a collision using `hash_record_extended`? I know that for 16-bits of MD5 it's approximately `1.47*10^-29`. — Gili, Jun 02 '22 at 01:56
Here is an even more revealing demo with a `UNIQUE` index broken by this deceiving cast: *db<>fiddle [here](https://dbfiddle.uk/?rdbms=postgres_14&fiddle=e54a72ccb8bc17a048b66398d0ec47d6)*. Is it a bug? Well, it's a grey area. Postgres might be more restrictive on this one, as it's very restrictive with immutability otherwise. Then again, Postgres also trusts user functions *faking* immutability ... and that's mighty useful at times. — Erwin Brandstetter, Jun 02 '22 at 02:01
@Gili: *Also, what is the probability of a collision using hash_record_extended?* Like I said: *Be sure to read this related answer for more explanation:* https://dba.stackexchange.com/a/299107/3684 It addresses your question in-depth. — Erwin Brandstetter, Jun 02 '22 at 02:14

score 0 · Answer 2 · answered Jun 02 '22 at 01:25

0

I think I found a solution!

Postgres doesn't like:

request_hash VARCHAR GENERATED ALWAYS AS (MD5(ROW(request, request_body)::VARCHAR)) STORED but

request_hash VARCHAR GENERATED ALWAYS AS (MD5(request || request_body::VARCHAR)) STORED works just fine.

Hooray!

answered Jun 02 '22 at 01:25

Gili

86,244
97
390
689

See discussion why I wouldn't advise to do that: https://stackoverflow.com/questions/72468749/how-to-create-a-generated-column-containing-the-md5-of-multiple-columns/72469657#comment128021122_72469657 – Erwin Brandstetter Jun 02 '22 at 02:21

How to create a GENERATED column containing the MD5 of multiple columns?

2 Answers2

Background

Linked