4

I have some difficulty in removing duplicates rows. I thought user_id and time_id together acting as an identifier but there were even duplicates for those.

user_id (text), time_id(bigint), value1 (numeric)

user_id; time_id; value1| 
aaa;1;3|
aaa;1;3|
aaa;2;4|
baa;3;1|

In this case how do I remove duplicates? Since I have 16 distinct values in time_id and 15,000 distinct ones in user_id, I tried something like this but I do not have an unique id..

    DELETE FROM tablename a
     USING tablename b
    WHERE a.unique_id < b.unique_id
    AND   a.user_id = b.user_id
    time_id = 1       (repeat till time_id 16)
Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
no_name
  • 1,083
  • 1
  • 8
  • 12

3 Answers3

13

Each table in Postgres has a few hidden system columns. One of them (ctid) is unique by definition and can be used in cases when a primary key is missing.

DELETE FROM tablename a
USING tablename b
WHERE a.ctid < b.ctid
AND a.user_id = b.user_id
AND a.time_id = b.time_id;

The problem is due to lack of primary key. Using hidden columns should not be a systematic method (see comments below). Once you delete duplicates you should create a primary key on (user_id, time_id) or create a new unique column for this purpose.

klin
  • 112,967
  • 15
  • 204
  • 232
  • 1
    Can be, but should generally not be. This possibly isn't the best advice for newbies, though it's perfectly valid. – Craig Ringer Oct 21 '17 at 12:47
  • It's simple, clear and *perfectly valid* so why it shouldn't be used? – klin Oct 21 '17 at 12:58
  • Because `ctid` isn't really public interface, it's non-standard implementation detail. And nothing stops PostgreSQL changing things in a future release in a way that breaks this query, either outright, or worse, subtly. – Craig Ringer Oct 21 '17 at 13:05
  • 3
    My tip is obviously casual, the table should of course have a primary key. As I remember you gave [similar advices in the past.](https://stackoverflow.com/a/22339983/1995738) – klin Oct 21 '17 at 13:16
  • Yes, I have, and I should add a suitable warning there too. Thanks. It's not *bad*, it's just something I think new users should preferably understand shouldn't be the first-preference solution. – Craig Ringer Oct 21 '17 at 13:19
  • Personally, I'd prefer an `EXISTS(...)` condition, and keep the row with the *lowest* `ctid`. – wildplasser Oct 21 '17 at 13:25
  • @wildplasser - I agree with you. I've used the OP's query to keep things simple. – klin Oct 21 '17 at 15:06
2

Please use any advice on deletions with care, make sure you have a way to "undo it" if needed. I think you need to add an auto-numbered column to assist in this endeavor

alter table tablename add column is_uniq serial

Then I'd suggest using row_number() to help identify the rows you do want to retain (where rn=1) and those to be deleted (where rn>1). Use the following as a guide:

select *
   , ROW_NUMBER()over(partition by user_id, time_id, value1 order by is_uniq) as rn from tablename

I'm not sure if there is any other columns(s) to use for order by, but if there are then you can include that into over clause as well.

Once you have the "is_uniq" column and the rn>1 rows you should be able to safely delete the unwanted rows.

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51
  • And *back up the table first*. Use `\copy tablename to 'tablename-before-delete.csv'` in`psql`as one easy way to do it. – Craig Ringer Oct 21 '17 at 12:46
0

If you don't want to rely on ctid (personally,I do) ,you can add a unique column (such as aserial) and use that for identity-purposes,


CREATE TABLE lutser
        ( user_id text not null
        ,  time_i integer not null
        , value integer not null
        );
INSERT INTO lutser(user_id,time_i,value) VALUES
('aaa', 1, 3)
,('aaa', 1, 3)
,('aaa', 2, 4)
,('baa', 3, 1)
        ;

SELECT*FROM lutser;

ALTER TABLE lutser
        ADD COLUMN seq serial NOT NULL UNIQUE
        ;
SELECT*FROM lutser;

DELETE FROM lutser del
WHERE EXISTS(
        SELECT*FROM lutser x
        WHERE x.user_id=del.user_id
        AND x.time_i=del.time_i
        AND x.seq < del.seq
        );

ALTER TABLE lutser
        ADD PRIMARY KEY (user_id,time_i)
        ;

SELECT*FROM lutser;
wildplasser
  • 43,142
  • 8
  • 66
  • 109