Remove duplicate rows based on a value in a field

Question

I have a very large table (several million records). SOME records have duplicates (based on FieldA) where the only difference is a value in FiedldB. I'd like to create a query that will remove all duplicate records based on FieldA, keeping the one with the lowest value in FieldB. Is this possible?

score 1 · Answer 1 · edited May 23 '17 at 12:28

CREATE TABLE TABLE1
  (
    FieldA VARCHAR2(30),
    FieldB VARCHAR2(30),
    FieldC VARCHAR2(30)
  );

INSERT INTO TABLE1 VALUES
  ('DUMMYDATA-A1','DUMMYDATA-B1','DUMMYDATA-C1'
  );
INSERT INTO TABLE1 VALUES
  ('DUMMYDATA-A1','DUMMYDATA-B4','DUMMYDATA-C1'
  );
INSERT INTO TABLE1 VALUES
  ('DUMMYDATA-A1','DUMMYDATA-B3','DUMMYDATA-C1'
  );
INSERT INTO TABLE1 VALUES
  ('DUMMYDATA-A1','DUMMYDATA-B2','DUMMYDATA-C1'
  );
COMMIT;

SELECT FieldA,
  FieldB,
  FieldC,
  RANK() OVER( PARTITION BY FieldA ORDER BY FieldB ASC) AS COLUMN_ALIAS
FROM TABLE1; --IDENTIFIES DUPLICATES BASED ON RANK VALUE

---PERFORM DELETE
DELETE
FROM TABLE1
WHERE ROWID IN
  (SELECT ROWID
  FROM
    (SELECT ROWID,
      RANK() OVER( PARTITION BY FieldA ORDER BY FieldB ASC) AS COLUMN_ALIAS
    FROM TABLE1
    )
  WHERE COLUMN_ALIAS>1
  );

COMMIT;

SELECT * FROM TABLE1; -- CONTAINS A SINGLE RECORD

RANK function can identify the duplicate records and facilitate deleting only the duplicated record , retaining the original row. This has been already discussed here : Deleting duplicates rows from oracle . Hope this helps

However as DELETE itself is slower, proper constraints can be implemented in this scenario (contains millions of records) at the INSERT to avoid duplicate entry.

score 0 · Accepted Answer · answered Apr 04 '15 at 13:40

0

Seems pretty straightforward to extract these values:

select distinct a, 
       min(b)   b 
from t
group by a;

Fiddle for example: http://sqlfiddle.com/#!9/bc4c9/3

You should be able to adapt a removal method from this.

answered Apr 04 '15 at 13:40

Hunter McMillen

59,865
24
119
170

I ran this using limit 100 and it ran fine. When I ran w/o it maxes out tmp file very quickly. Is there a way to run a query like this where it flushes tmp or am I going to manually run in batches e.g Where ID>0 and ID <100? That will sort of suck as there are 1/2 million records or so. – user3649739 Apr 05 '15 at 13:34
Why are you writing to a temp file? I thought you wanted to delete records. – Hunter McMillen Apr 05 '15 at 13:38

Remove duplicate rows based on a value in a field

2 Answers2