I have a web application backed by a Postgres (v11) database and a main table where each row in the table may be seen as an object and each column is a field of the object.
So we have:
| id | name | field1 | field2| .... | field 100|
-----------------------------------------------
| 1 | foo | 12.2 | blue | .... | 13.7 |
| 2 | bar | 22.1 | green | .... | 78.0 |
The table was created using:
CREATE TABLE records(
id VARCHAR(50) PRIMARY KEY,
name VARCHAR(50),
field1 NUMERIC,
field2 VARCHAR(355),
field100 NUMERIC);
Now I have an audit table which stores updates to every field of every object. The audit table is defined as:
| timestamp | objid | fieldname | oldval | newval |
-----------------------------------------------
| 1234 | 1 | field2 | white | blue |
| 1367 | 1 | field1 | "11.5" | "12.2" |
| 1372 | 2 | field1 | "11.9" | "22.1" |
| 1387 | 1 | name | baz | foo |
The table was created using:
CREATE TABLE audit_log(
timestamp TIMESTAMP,
objid VARCHAR (50) REFERENCES records(id),
fieldname VARCHAR (50) NOT NULL,
oldval VARCHAR(355),
newval VARCHAR(355));
oldval
/newval
are kept as varchar
since they are purely for auditing purpose so the actual datatype does not really matter.
For obvious reasons, this table has gotten huge over the last few years or so so I wanted to delete some old data. Someone suggested keeping only the last 5 updates for every object (i.e. the UI can then show the last 5 updates from the audit table).
I understand you can get this using a GROUP BY
and a LIMIT
but the problem is that I have a million+ objects, with some having been updated a thousand+ times while others have barely a few updates over multiple years. And the audit log is being very read/write heavy (as can be expected).
What would be the best way to delete all entries which are older than the 5th latest update for each object (of course, ideally I'll move that out to some secondary storage)?