I have a table without an id or primary key field:
CREATE TABLE IF NOT EXISTS test
(
time TIMESTAMPTZ NOT NULL,
bid_volume INTEGER,
bid_price DOUBLE PRECISION,
ask_price DOUBLE PRECISION,
ask_volume INTEGER
);
I have a number of rows which have duplicate timestamps:
# SELECT * FROM test ORDER BY time;
time | bid_volume | bid_price | ask_price | ask_volume
-------------------------------+------------+-----------+-----------+------------
2021-02-01 08:00:00.123457+00 | 99 | 1256.25 | 1256.5 | 75
2021-02-01 08:00:00.123457+00 | 100 | 1256.25 | 1256.5 | 75
2021-02-01 08:00:00.123457+00 | 100 | 1256.25 | 1256.5 | 76
2021-02-01 08:00:00.123457+00 | 100 | 1256.5 | 1256.5 | 76
2021-02-01 08:00:01.1227+00 | 100 | 1256.5 | 1256.5 | 76
2021-02-01 08:01:01.126343+00 | 80 | 1257.5 | 1256.5 | 76
(6 rows)
I would like to delete all but one of the duplicate timestamps.
My investigations thus far lead me to believe I can use ROW_NUMBER()
with a PARTITION
, and delete all rows with rn
> 1
test=# SELECT ROW_NUMBER() OVER (PARTITION BY time) AS rn, * FROM test;
rn | time | bid_volume | bid_price | ask_price | ask_volume
----+-------------------------------+------------+-----------+-----------+------------
1 | 2021-02-01 08:00:00.123457+00 | 99 | 1256.25 | 1256.5 | 75
2 | 2021-02-01 08:00:00.123457+00 | 100 | 1256.25 | 1256.5 | 75
3 | 2021-02-01 08:00:00.123457+00 | 100 | 1256.25 | 1256.5 | 76
4 | 2021-02-01 08:00:00.123457+00 | 100 | 1256.5 | 1256.5 | 76
1 | 2021-02-01 08:00:01.1227+00 | 100 | 1256.5 | 1256.5 | 76
1 | 2021-02-01 08:01:01.126343+00 | 80 | 1257.5 | 1256.5 | 76
There are several questions on SO which seem to be related:
- deleting-all-duplicate-rows-but-keeping-one
- how-can-i-remove-duplicate-rows
- delete-using-cte-slower-than-using-temp-table-in-postgres
However, all these questions use an additional id
field.
Without an additional id
field I am unable to keep one of the rows:
# DELETE FROM test del
USING
(
SELECT
time,
row_number() OVER (PARTITION BY time) AS rn
FROM test
) sub
WHERE
sub.time = del.time
AND
sub.rn > 1;
This deletes all duplicate rows (ie: the sub.rn > 1
doesn't seem to have any effect)
# SELECT * FROM test ORDER BY time;
time | bid_volume | bid_price | ask_price | ask_volume
-------------------------------+------------+-----------+-----------+------------
2021-02-01 08:00:01.1227+00 | 100 | 1256.5 | 1256.5 | 76
2021-02-01 08:01:01.126343+00 | 80 | 1257.5 | 1256.5 | 76
Notice timestamp 2021-02-01 08:00:00.123457+00
is gone!
Dropping the sub.time = del.time
clause causes all the rows to be deleted! (again, sub.rn
doesn't seem to have any effect)
# DELETE FROM test del
USING
(
SELECT
time,
row_number() OVER (PARTITION BY time) AS rn
FROM test
) sub
WHERE
sub.rn > 1;
DELETE 6
<-- Oh no!
How can I delete the duplicates from my table, keeping one?