4

I have a table with PK id, two FK (two other id's) and a timestamp. I would like to have a limit on a that table so there can be only limited number of rows for each combination of FK's. For example, if there are more than 10 rows with same FK's the oldest one by the timestamp should be removed.

Currently, solution idea is a trigger before insert which should check are there more than some number of rows in the table with same FK's as a row that is going to be inserted. If there are the oldest one by field timestamp should be deleted.

Can someone help me with implementing this?

APC
  • 144,005
  • 19
  • 170
  • 281
Tonci Zilic
  • 83
  • 1
  • 7
  • What if application error causes it to add lots of bad rows very quickly? Your old history will be gone in an instant. Also, deleting data in update trigger will make updates slower. It's better to implement purging old data as a batch process with some auditing, perhaps even backing up old rows somewhere. – mvp Mar 10 '19 at 09:40
  • There is a theoretical SQL construct for enforcing constraints across rows, the ASSERTION. Unfortunately PostgreSQL hasn't implemented it (nor has any other enterprise RDBMS product). The big sticking point is the one which will trip you up: concurrency. Say, your table has ten rows and two users simultaneously try to insert a row. Now both sessions will attempt to delete the oldest row: uh oh! So you need to serialize DML access to the table. – APC Mar 10 '19 at 10:27
  • I should have explained that I am sure that only one user will be writing at the time, since backend application is dealing with that – Tonci Zilic Mar 10 '19 at 10:43
  • Why do you need to limit the rows logically stored? I would rather go with a row-limiting view (e.g. https://stackoverflow.com/questions/1124603/grouped-limit-in-postgresql-show-the-first-n-rows-for-each-group ) and then do a period cleanup if required for performance reasons. Any solution that works directly at the writing transactions will have a serious performance penalty under high load. If your data is time-series, you can also consider integrating with an extension such as [TimescaleDB](https://docs.timescale.com), which can give you a serious boost in write and read performance. – Ancoron Mar 10 '19 at 11:22

2 Answers2

1

An alternative solution to this problem is using array columns instead of rows and the INSERT INTO ... ON CONFLICT UPDATE ... support.

Online example: https://www.db-fiddle.com/f/2y46V6EEVJLQ5cPNTDAUPy/0

Structure:

CREATE TABLE test_rr (
    id serial primary key,
    fk_1 integer not null,
    fk_2 integer not null,
    latest timestamptz[] not null
);

CREATE UNIQUE INDEX idx_unique_rr ON test_rr (fk_1, fk_2);

Upserting data:

INSERT INTO test_rr (fk_1, fk_2, latest)
    VALUES (1, 2, array[current_timestamp])
    ON CONFLICT (fk_1, fk_2) DO UPDATE SET latest = (array_cat(EXCLUDED.latest, test_rr.latest))[:10];

Selecting entries:

SELECT id, fk_1, fk_2, unnest(latest) AS ts FROM test_rr WHERE fK_1 = 1 AND fk_2 = 2;

...resulting into:

 id  | fk_1 | fk_2 |             ts
-----+------+------+-------------------------------
 652 |    1 |    2 | 2019-03-10 13:28:57.806489+01
 652 |    1 |    2 | 2019-03-10 13:28:56.670678+01
 652 |    1 |    2 | 2019-03-10 13:28:55.470668+01
 652 |    1 |    2 | 2019-03-10 13:28:54.174111+01
 652 |    1 |    2 | 2019-03-10 13:28:52.878719+01
 652 |    1 |    2 | 2019-03-10 13:28:51.3748+01
 652 |    1 |    2 | 2019-03-10 13:28:49.886457+01
 652 |    1 |    2 | 2019-03-10 13:28:48.190317+01
 652 |    1 |    2 | 2019-03-10 13:28:46.350833+01
 652 |    1 |    2 | 2019-03-10 13:11:50.506323+01
(10 rows)

Instead of timestamptz[], you can also create your own type to support more columns:

CREATE TYPE my_entry_data AS (ts timestamptz, data varchar);

CREATE TABLE test_rr (
    id serial primary key,
    fk_1 integer not null,
    fk_2 integer not null,
    latest my_entry_data[] not null
);

CREATE UNIQUE INDEX idx_unique_rr ON test_rr (fk_1, fk_2);

-- ...
INSERT INTO test_rr (fk_1, fk_2, latest)
  VALUES (1, 2, array[(current_timestamp,'L')::my_entry_data])
  ON CONFLICT (fk_1, fk_2) DO UPDATE
    SET latest = (array_cat(EXCLUDED.latest, test_rr.latest))[:10];

SELECT id, fk_1, fk_2, tmp.ts, tmp.data
FROM test_rr, unnest(latest) AS tmp -- LATERAL function call
WHERE fK_1 = 1 AND fk_2 = 2;

However, load tests have to show whether this is actually faster than triggers or other approaches. At least this has the benefit that rows will just be updated and not inserted + deleted which might save some I/O.

Ancoron
  • 2,447
  • 1
  • 9
  • 21
0

You should use a separate summary table, maintained with a trigger, which will contain the columns fk1, fk2 and count with a check count<=N and index on (fk1, fk2).

Before you insert a row to your original table you should check the value of count and, if it is at the limit, you delete the oldest row first. This can be done with a trigger if you don't want to do this in the application.

You have to remember that:

  • if you change fk1 or fk2 values or delete multiple rows with the same fk1 and fk2 in the same transaction, you have to make sure you do this in some specified order (for example ordered by id) or you can get deadlocks;
  • you can't add more than N rows with the same (fk1, fk2) in a single transaction - there will be not enough rows to delete first;
  • there will be performance penalty (worse parallelizm) for adding multiple rows with the same (fk1, fk2).

A simple trigger which only checks a number of rows and deletes oldest before insert, can be slow, if there are many (like 100+) rows with the same (fk1, fk2). Also it can allow too many rows while multiple inserts are done in parallel.

Tometzky
  • 22,573
  • 5
  • 59
  • 73