"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint
PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE
. I suggest a custom function for the purpose:
-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
RETURNS uuid
LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';
-- data table
CREATE TABLE data (
data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)
);
-- reference data_id in next table
CREATE TABLE day_data (
day text
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id) -- must be unique?
);
db<>fiddle here
With only text
columns, the function is actually IMMUTABLE
(which we need!). For other data types (like timestamptz
) it would not be.
In-depth explanation in this closely related answer:
You could use uniq_hash
as PK directly, but for many references, a bigint
is more efficient (8 vs. 16 bytes).
About generated columns:
Basic technique to avoid duplicates while inserting new data:
INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;
If there can be concurrent writes, see: