Unnest vs just having every row needed in table

Question

I have a choice in how a data table is created and am wondering which approach is more performant.

Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested

That is, if I have the data:

day	val1	val2
Mon	7	11
Tue	7	11
Wed	8	9
Thu	1	4

Is it better to enter the data in as shown, or instead:

day	val1	val2
(Mon,Tue)	7	11
(Wed)	8	9
(Thu)	1	4

And then use unnest() to explode those into unique rows when I need them?

Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.

Does this answer your question? [Is storing a delimited list in a database column really that bad?](https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad) — philipxy, Mar 15 '22 at 05:47
The unnesting during a query is most likely more expensive than storing this in a properly normalized form. E.g. It's nearly impossible to index the result of the unnest if you need to apply any conditions on that result. — , Mar 15 '22 at 06:05
Is `day` a contiguous range of values? Then maybe using a range (or multirange) column might be another option. — , Mar 15 '22 at 06:07
100k rows a day is imho not a lot. That's just 36.5 million a year. Too much to do it manually, but that's why you use a database. We proces about 40 million records a day and it works fine. — Frank Heikens, Mar 15 '22 at 09:10

score 1 · Answer 1 · answered Mar 15 '22 at 04:03

1

I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus unless your data will be very repeated 20 columns is alot to align.

answered Mar 15 '22 at 04:03

Darcy Giblin

11
1

Erwin Brandstetter · Answer 2 · 2022-03-15T05:44:03.837

"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.

Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:

-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
  RETURNS uuid
  LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS 
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';

-- data table
CREATE TABLE data (
  data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)   
);

-- reference data_id in next table
CREATE TABLE day_data (
  day text 
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id)      -- must be unique?
);

db<>fiddle here

With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.

In-depth explanation in this closely related answer:

Why doesn't my UNIQUE constraint trigger?

You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).

About generated columns:

Computed / calculated / virtual / derived columns in PostgreSQL

Basic technique to avoid duplicates while inserting new data:

INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;

If there can be concurrent writes, see:

How to use RETURNING with ON CONFLICT in PostgreSQL?

Thanks for your detailed response. What I'm unsure about it how it answers my question. It may be because I failed to include the unique key that's in my data (SO gave me a lot of trouble posting this question), but my question is focused on performance. I am in control of table data population as it inserts from a python script so I can manage dups vs exploded inserts. Nonetheless, I think you solved my problem for another project. — autonopy, Mar 15 '22 at 12:45
@autonopy: Avoiding duplicative storage to keep table sizes reasonable will be the main angle for performance with 100k new rows per day. If you don't *need* the unique hash to handle duplicates (efficiently), all the better. — Erwin Brandstetter, Mar 16 '22 at 04:21

Unnest vs just having every row needed in table

2 Answers2