Delete duplicate records from a Postgresql table without a primary key?

Question

i have table like

CREATE TABLE meta.fk_payment1
(
  id serial NOT NULL,
  settlement_ref_no character varying,
  order_type character varying,
  fulfilment_type character varying,
  seller_sku character varying,
  wsn character varying,
  order_id character varying,
  order_item_id bigint,
   ....
  );

i am inserting data from csv file where all column are same instead of id column

In case when csv file uploaded more then one time the data will be duplicate .

but id will not and id is primary key.

so I want to remove all duplicate row without using primary key .

I have to do this on single table

Create copy table, do insert into newtab select distinct from oldtab. — jarlh, May 11 '15 at 07:02
I'm a bit confused about the "id is primary key", but "remove ... without using primary key". Do you mean there is no primary key in the csv, but in the database there is one? Should edit the question to be a bit clearer about this. — Sami Kuhmonen, May 11 '15 at 07:13
Duplicate rows based on which columns? Only the `id` column? And which "duplicates" do you want to keep? Does it matter? Is there some kind of timestamp column that identifies that "latest" row that you want to keep? — , May 11 '15 at 07:15
This has been asked many times: http://stackoverflow.com/search?q=[postgresql]+delete+duplicates specifically: http://stackoverflow.com/questions/29409184/delete-duplicate-rows-from-table-with-no-unique-key — , May 11 '15 at 07:45

score 14 · Answer 1 · answered Oct 19 '15 at 14:25

You can do like this e.g.

DELETE FROM table_name
    WHERE   ctid NOT IN
    (SELECT     MAX(dt.ctid)
        FROM        table_name As dt
        GROUP BY    dt.*);

run this query

DELETE FROM meta.fk_payment1
    WHERE   ctid NOT IN
    (SELECT     MAX(dt.ctid)
        FROM        meta.fk_payment1 As dt
        GROUP BY    dt.*);

Robert · Accepted Answer · 2015-05-18T06:42:20.747

4

Copy distinct data to work table fk_payment1_copy. The simplest way to do that is to use into

SELECT max(id),settlement_ref_no ... 
INTO fk_payment1_copy
from fk_payment1
GROUP BY settlement_ref_no ...

delete all rows from fk_payment1

delete from fk_payment1

and copy data from fk_payment1_copy table to fk_payment1

insert into fk_payment1
select id,settlement_ref_no ... 
from fk_payment1_copy

edited May 18 '15 at 06:42

answered May 11 '15 at 07:04

Robert

25,425
8
67
81

@Shubhambatra I'm not sure if there is simplest way. You can't create new table, can you? – Robert May 11 '15 at 07:14
I can create but in future if any column become foreign key then it is not to able delete from fk_payment1 . Is it? – May 11 '15 at 10:13
@Shubhambatra Right it maybe more complicated. I suggest you to import CSV to work table than copy it to destination with distinct. – Robert May 11 '15 at 10:16
while i try to remove duplicate row from work table then your method is also changed the id that i dont want – May 16 '15 at 16:52
@Shubhambatra Id is copied so can't be changed. – Robert May 18 '15 at 06:31
if i use SELECT distinct id,settlement_ref_no ... INTO fk_payment1_copy from fk_payment1 then it always give all row becoz id is always different – May 18 '15 at 06:37
@Shubhambatra Now I see the problem, I changed 1st query, try it now please. – Robert May 18 '15 at 06:42

score 3 · Answer 3 · answered Nov 28 '16 at 06:07

if the table isn't very large you can do:

-- create temporary table and select distinct into it. 
CREATE TEMP TABLE tmp_table AS 
SELECT DISTINCT column_1, column_2 
FROM original_table ORDER BY column_1, column_2;

-- clear the original table 
TRUNCATE original_table;

-- copy data back in again 
INSERT INTO original_table(column_1, column_2) 
SELECT * FROM  tmp_table ORDER BY column_1, column_2;

-- clean up
DROP TABLE tmp_table

for larger tables remove the TEMP command from the tmp_table creation
this solution comes in handy when working with JPA (Hibernate) produced @ElementCollection which are created without primary key.

score 2 · Answer 4 · answered May 11 '15 at 07:12

A bit unsure about the primary key part in the question, but in any case id doesn't need to be a primary key, it just needs to be unique. As it should be since it's serial. So if it has unique values, you can do it this way:

DELETE FROM fk_payment1 f WHERE EXISTS
  (SELECT * FROM fk_payment1 WHERE id<f.id
   AND settlement_ref_no=f.settlement_ref_no
   AND ...)

Just need to add all columns in the select query. This way all rows that have the same values (except id) and are after this row (sorted by id) will be deleted.

(Also, naming a table with fk_ prefix makes it look like a foreign key.)

I use this but its not giving exact output . its missing some row . — , May 11 '15 at 10:11

score 0 · Answer 5 · edited Jan 09 '18 at 16:45

0

So there is a slick way right in PG wiki. https://wiki.postgresql.org/wiki/Deleting_duplicates

This query does that for all rows of tablename having the same column1, column2, and column3.

DELETE FROM tablename
WHERE id IN (SELECT id
              FROM (SELECT id,
                             ROW_NUMBER() OVER (partition BY column1, column2, column3 ORDER BY id) AS rnum
                     FROM tablename) t
              WHERE t.rnum > 1);

I was testing this on de-duping 600k rows, leading to 200k unique rows. The solution using group by and NOT IN took 3h+, this takes like 3s.

edited Jan 09 '18 at 16:45

Thomas Rollet

1,573
4
19
33

answered Jan 09 '18 at 16:06

Ladas

1

1

if I am not mistaken, this deletes all copies of the duplicate instead of leaving one of the copies remaining. So if you have two rows with the same id, you will be left with zero rows with that id after running this. – DragonBobZ Dec 21 '19 at 20:11

Delete duplicate records from a Postgresql table without a primary key?

5 Answers5