PostgreSQL delete all but the oldest records

Question

I have a PostgreSQL database that has multiple entries for the objectid, on multiple devicenames, but there is a unique timestamp for each entry. The table looks something like this:

address | devicename | objectid      |  timestamp       
--------+------------+---------------+------------------------------
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-03 15:37:09.06065+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-03 15:48:33.93128+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-05 16:01:59.266779+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-05 16:13:46.843113+00
1.1.1.1 | device1    | vs_hub.ch1_25 | 2012-10-06 01:11:45.853361+00
1.1.1.2 | device2    | vs_hub.ch1_25 | 2012-10-06 01:23:21.204324+00

I want to delete all but the oldest entry for each odjectid and devicename. In this case I want to delete all but:

1.1.1.1 | device1 | vs_hub.ch1_25 | 2012-10-02 17:36:41.011629+00
1.1.1.2 | device2 | vs_hub.ch1_25 | 2012-10-02 17:48:01.755559+00

Is there a way do this? Or is it possible to select the oldest entries for both "objectid and devicename" into a temp table?

Does the table have a primary key or unique identifier column? — jcern, Oct 10 '12 at 15:07
No, the table does not have a primary key or identifier column... Thank you for any suggestions you have. — dars33, Oct 10 '12 at 15:15
Please identify what column or group of columns identify unequivocally a row. That would make it all easier. If there's not such a group of columns, then you should add a sequence-based ID column. — Tulains Córdova, Oct 10 '12 at 17:42

score 7 · Answer 1 · 2012-10-10T18:50:31.073

This should do it:

delete from devices
using (
   select ctid as cid, 
          row_number() over (partition by devicename, objectid order by timestamp asc) as rn
   from devices
) newest
where newest.cid = devices.ctid
and newest.rn <> 1;

It creates a derived table that will assign unique numbers to each combination of (address, devicename, objectid) giving the earliest one (the one with the smallest timestamp value) the number 1. Then this result is used to delete all those that do not have the number 1. The virtual column ctid is used to uniquely identify those rows (it's an internal identifier supplied by Postgres).

Note that for deleting a really large amount of rows, Erwin's approach will most definitely be faster.

SQLFiddle demo: http://www.sqlfiddle.com/#!1/5d9fe/2

+1 for using [`ctid`](http://www.postgresql.org/docs/current/interactive/ddl-system-columns.html) as temporary fallback for the missing primary key — Erwin Brandstetter, Oct 10 '12 at 19:36

Erwin Brandstetter · Accepted Answer · 2019-12-18T23:31:35.193

To distill the described result, this would probably simplest and fastest:

SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

Details and explanation in this related answer.

From your sample data, I conclude that you are going to delete large portions of the original table. It is probably faster to just TRUNCATE the table (or DROP & recreate, since you should add a surrogate pk column anyway) and write the remaining rows to it. This also provides you with a pristine table, implicitly clustered (ordered) the way it's best for your queries and save the work that VACUUM would have to do otherwise. And it's probably still faster overall:

I would also strongly advise to add a surrogate primary key to your table, preferably a serial column.

BEGIN;

CREATE TEMP TABLE tmp_tbl ON COMMIT DROP AS
SELECT DISTINCT ON (devicename, objectid) *
FROM   tbl
ORDER  BY devicename, objectid, ts DESC;

TRUNCATE tbl;
ALTER TABLE tbl ADD column tbl_id serial PRIMARY KEY;

-- or, if you can afford to drop & recreate:
-- DROP TABLE tbl;
-- CREATE TABLE tbl (
--   tbl_id serial PRIMARY KEY
-- , address text
-- , devicename text
-- , objectid text
-- , ts timestamp);

INSERT INTO tbl (address, devicename, objectid, ts)
SELECT address, devicename, objectid, ts
FROM   tmp_tbl;

COMMIT;

Do it all within a transaction to make sure you are not going to fail half way through.

This is fast as long as your setting for temp_buffers is big enough to hold the temporary table. Else the system will start swapping data to disk and performance takes a dive. You can set temp_buffers just for the current session like this:

SET temp_buffers = 1000MB;

So you don't waste RAM that you don't normally need for temp_buffers. Has to be set before the first use of any temporary objects in the session. More information in this related answer.

Also, as the INSERT follows a TRUNCATE inside a transaction, it will be easy on the Write Ahead Log - improving performance.

Consider CREATE TABLE AS for the alternative route:

What causes large INSERT to slow down and disk usage to explode?

The only downside: You need an exclusive lock on the table. This may be a problem in databases with heavy concurrent load.

Finally, never use timestamp as column name. It's a reserved word in every SQL standard and a type name in PostgreSQL. I used ts instead.

Thank you for taking the time to reply. Your answer was very helpful and did solve my issue. Thx Erwin! — dars33, Oct 10 '12 at 18:58

score 0 · Answer 3 · answered Oct 10 '12 at 15:09

0

DELETE FROM DEVICES D WHERE d.timestamp = (SELECT min(timestamp) FROM DEVICES WHERE objectid = d.objectid and device = d.device)

answered Oct 10 '12 at 15:09

Hola Soy Edu Feliz Navidad

6,746
4
38
55

1

Be careful when using that query as it will delete additional records for other devices if the same timestamp exists in the table for more than 1 device. – jcern Oct 10 '12 at 15:12
1

Backwards. Here you are deleting the oldest. He wants to keep the oldest and delete everything else. Also you will be deleting anything that shares the exact same timestamp (may not be likely, but could be possible I would assume) – digitaljoel Oct 10 '12 at 15:12

jcern · Answer 4 · 2012-10-10T18:39:52.710

0

This should work assuming that address, devicename and objectid make up a unique identifier

DELETE FROM tablename 
WHERE 
  address || devicename || objectid || timestamp NOT IN 
  (SELECT 
     address || devicename || objectid || min(timestamp) 
   FROM tablename 
   GROUP BY address, devicename, objectid)

This uses a concatenated string that consists of the unique columns to tie the selects together. One finds the min date for that unique combination, the next deletes those records from the table. Probably not the most efficient, but it should work.

edited Oct 10 '12 at 18:39

answered Oct 10 '12 at 15:30

jcern

7,798
4
39
47

To me this looks ugly and terribly inefficient. BTW: The *middle* `IN ()` loop is only there to guard against NULL fields? – wildplasser Oct 10 '12 at 18:25
I mentioned that it might not be the most efficient. I think you are correct, the middle IN was not necessary and has been removed from the post, it was a remnant of testing. – jcern Oct 10 '12 at 18:41
1

The middle IN would be necessary if any of the concatenated were nullable. (Remember: IN is a mostly a bad habit. EXISTS is cleaner in almost every case) – wildplasser Oct 10 '12 at 18:43
To safeguard against `NULL` values in string concatenation use [`concat()` or `concat_ws()`](http://www.postgresql.org/docs/current/interactive/functions-string.html#FUNCTIONS-STRING-OTHER). I wouldn't use this particular query. Better alternatives have been posted. – Erwin Brandstetter Oct 10 '12 at 19:41

Aleksandr Dezhin · Answer 5 · 2012-10-10T16:29:09.573

0

My suggestion is to use a subquery, that checks existance of record with older timestamp:

DELETE FROM tablename
WHERE EXISTS(
  SELECT * FROM tablename a
  WHERE tablenmae.address = a.address
    AND tablename.devicename = a.devicename
    AND tablename.objectid = a.objectid
    AND a.timestamp < tablename.timestamp
)

Query for selecting oldest records will be look like this:

SELECT address, devicename, objectid, MIN(timestamp)
FROM tablename
GROUP BY address, devicename, objectid

edited Oct 10 '12 at 16:29

answered Oct 10 '12 at 15:42

Aleksandr Dezhin

636
4
7

Alias "a" should be on most external query. – Tulains Córdova Oct 10 '12 at 16:40
The question asks for distinct rows on `(odjectid, devicename)`. You got that wrong and distill rows for `(address, odjectid, devicename)`. As a consequence, the `SELECT` query does not work **at all** - `GROUP BY` cannot achieve the required result in one query level. You'd need a subquery, window functions or `DISTINCT ON` like demonstrated in my answer. The `DELETE`query would still work, though. – Erwin Brandstetter Oct 10 '12 at 16:44

PostgreSQL delete all but the oldest records

5 Answers5

Linked