Delete duplicates from a postgresql table without an ID field

Question

I have a table without an id or primary key field:

CREATE TABLE IF NOT EXISTS test
(
    time       TIMESTAMPTZ NOT NULL,
    bid_volume INTEGER,
    bid_price  DOUBLE PRECISION,
    ask_price  DOUBLE PRECISION,
    ask_volume INTEGER
);

I have a number of rows which have duplicate timestamps:

# SELECT * FROM test ORDER BY time;
             time              | bid_volume | bid_price | ask_price | ask_volume 
-------------------------------+------------+-----------+-----------+------------
 2021-02-01 08:00:00.123457+00 |         99 |   1256.25 |    1256.5 |         75
 2021-02-01 08:00:00.123457+00 |        100 |   1256.25 |    1256.5 |         75
 2021-02-01 08:00:00.123457+00 |        100 |   1256.25 |    1256.5 |         76
 2021-02-01 08:00:00.123457+00 |        100 |    1256.5 |    1256.5 |         76
 2021-02-01 08:00:01.1227+00   |        100 |    1256.5 |    1256.5 |         76
 2021-02-01 08:01:01.126343+00 |         80 |    1257.5 |    1256.5 |         76
(6 rows)

I would like to delete all but one of the duplicate timestamps.

My investigations thus far lead me to believe I can use ROW_NUMBER() with a PARTITION, and delete all rows with rn > 1

test=# SELECT ROW_NUMBER() OVER (PARTITION BY time) AS rn, * FROM test;
 rn |             time              | bid_volume | bid_price | ask_price | ask_volume 
----+-------------------------------+------------+-----------+-----------+------------
  1 | 2021-02-01 08:00:00.123457+00 |         99 |   1256.25 |    1256.5 |         75
  2 | 2021-02-01 08:00:00.123457+00 |        100 |   1256.25 |    1256.5 |         75
  3 | 2021-02-01 08:00:00.123457+00 |        100 |   1256.25 |    1256.5 |         76
  4 | 2021-02-01 08:00:00.123457+00 |        100 |    1256.5 |    1256.5 |         76
  1 | 2021-02-01 08:00:01.1227+00   |        100 |    1256.5 |    1256.5 |         76
  1 | 2021-02-01 08:01:01.126343+00 |         80 |    1257.5 |    1256.5 |         76

There are several questions on SO which seem to be related:

However, all these questions use an additional id field.

Without an additional id field I am unable to keep one of the rows:

# DELETE FROM test del
USING 
( 
    SELECT 
        time, 
        row_number() OVER (PARTITION BY time) AS rn
    FROM test
) sub
WHERE 
    sub.time = del.time
AND 
    sub.rn > 1;

This deletes all duplicate rows (ie: the sub.rn > 1 doesn't seem to have any effect)

# SELECT * FROM test ORDER BY time;
             time              | bid_volume | bid_price | ask_price | ask_volume 
-------------------------------+------------+-----------+-----------+------------
 2021-02-01 08:00:01.1227+00   |        100 |    1256.5 |    1256.5 |         76
 2021-02-01 08:01:01.126343+00 |         80 |    1257.5 |    1256.5 |         76

Notice timestamp 2021-02-01 08:00:00.123457+00 is gone!

Dropping the sub.time = del.time clause causes all the rows to be deleted! (again, sub.rn doesn't seem to have any effect)

# DELETE FROM test del
USING 
( 
    SELECT 
        time, 
        row_number() OVER (PARTITION BY time) AS rn
    FROM test
) sub
WHERE 
    sub.rn > 1;

DELETE 6 <-- Oh no!

How can I delete the duplicates from my table, keeping one?

Next time: make sure your table has a Primary Key. [theoretically, a table without a PK has no meaning] — wildplasser, Jun 17 '21 at 17:39
@wildplasser as Jasen said, timestamps is the id field, but sometimes there are duplicates which I need to remove. In fact, after the cleaning, I update the table to set the timestamp as the primary key. — Steve Lorimer, Jun 18 '21 at 07:28
@Jasen you are correct! I'm doing a `psql -c "COPY ... FROM STDIN` - if only I could add `ON CONFLICT DO UPDATE` then this wouldn't be necessary — Steve Lorimer, Jun 18 '21 at 07:35
So if two events happen at exactly the same time, you choose to ignore one of them? I hope you don't work for my bank. — wildplasser, Jun 18 '21 at 07:43
@wildplasser you don't know anything about the problem domain that I'm working in here - so yes, for this particular problem, ignoring all but the last update is perfectly fine — Steve Lorimer, Jun 18 '21 at 07:49
in your example, you have four observations with tmestamp=`2021-02-01 08:00:00.123457+00` with three different sets of values. Which one(s) are you going to keep and why? And which one is *the last* ? — wildplasser, Jun 18 '21 at 07:51
@wildplasser honestly, for my use case, it doesn't *really* matter. If I was being pedantic then I'd like to define the *last* one as the last one inserted (max `ctid` I assume? They are read from a Refinitiv tick history csv file using `zcat file | psql -c "COPY ... FROM STDIN"`), but equally, it doesn't really matter — Steve Lorimer, Jun 18 '21 at 08:07

score 2 · Accepted Answer · answered Jun 17 '21 at 17:34

2

You can use the hidden system column "ctid".

delete from test where exists 
    (select 1 from test as test2 where test2.time=test.time and test2.ctid>test.ctid)

answered Jun 17 '21 at 17:34

jjanes

37,812
5
27
34

Jasen · Answer 2 · 2021-06-18T06:17:20.277

There's this sledgehammer

begin;
create temp table a on commit drop as select distinct on (time) * FROM test ORDER BY time;
truncate test;
insert into test select * from a;
commit;

Or using ctid to find the records to purge.

with a as ( select time ,min(ctid) from test group by time) 
delete from test using a 
where test.time=a.time and test.ctid > a.min;

Or the same with a subquery instead of CTE.

 delete from test 
 using ( select time ,min(ctid) from test group by test) as a
 where test.time=a.time and test.ctid > a.min;

Delete duplicates from a postgresql table without an ID field

2 Answers2