0

I know that it is possible to remove duplicates from a table using the ROWID in the following example form:

DELETE FROM TABLE_NAME 
WHERE ROWID IN 
(SELECT
MIN(ROWID)
FROM TABLE_NAME
WHERE ...
GROUP BY COLUMN_NAME_1, COLUMN_NAME_2, etc. ) 

I wonder if this is possible to be done for entries appearing more than 2 times. If not are there any other methods to be done?

Whitebeard13
  • 411
  • 1
  • 7
  • 17
  • You shouldn't be using rowid, instead, consider using some unique or primary key instead. – Jorge Campos Aug 24 '18 at 12:18
  • If you don't have a primary key, then this is a viable approach. – Gordon Linoff Aug 24 '18 at 12:21
  • @Jorge Campos: I am new to SQL. Do you say that only for the triplicate/quadruplicate cases or also for the duplicates? When you say primary key can you provide and example? – Whitebeard13 Aug 24 '18 at 12:24
  • @Gordon Linoff. If I use this method for the cases that we have triplicates or quadruplicates, then I must run the above scrip multiple times because by running it only once, it removes only one entry and leaves the rest of the copies. – Whitebeard13 Aug 24 '18 at 12:26
  • A primary key is a column defined in your table that does not have duplicated values. Usually it is an ID column, show your table structure. – Jorge Campos Aug 24 '18 at 12:30
  • This is already answered [here](https://stackoverflow.com/questions/529098/removing-duplicate-rows-from-table-in-oracle) – tbone Aug 24 '18 at 12:39
  • The query you show is incorrect. It doesn't only "remove duplicates", it also removes all the rows that are unique by `COLUMN_NAME_1, COLUMN_NAME_2, etc.` How do you "know" that you can remove duplicates this way, did you see it suggested somewhere? It is 100% wrong. –  Aug 24 '18 at 13:30
  • Now I am wondering if we are understanding your question correctly. Something crossed my mind. Do you want to keep the rows that are "unique" by the values in GROUP BY, and also those that are just "duplicates", but if they are "triplicate or more" do you want to reduce the number to TWO, not to ONE? So, if there were five copies of the same `COLUMN_NAME_1, ...`, keep just TWO of them? That would be a more interesting question! –  Aug 24 '18 at 13:34
  • @mathguy The `WHERE` statement is the one filtering the rows for which we have more than one entries. – Whitebeard13 Aug 24 '18 at 13:36
  • @mathguy: I want to keep the "unique" and for cases of duplicates, triplicates, quadruplicates, to keep just one row of them – Whitebeard13 Aug 24 '18 at 13:38
  • @Whitebeard13 - perhaps that is possible, but it makes no sense. The proper way to "filter" is with a `HAVING` condition (which operates on the groups), for example `HAVING count(*) > 1` - not a `WHERE` condition, which operates on each input row in isolation, not "as a group". –  Aug 24 '18 at 14:01

3 Answers3

1

if you want preserve the most recent rowid

you could use a NOT IN for max(rowid) group by all the columns

  DELETE FROM TABLE_NAME 
  WHERE ROWID NOT IN 
  (SELECT
  MAX(ROWID)
  FROM TABLE_NAME
  WHERE...
  GROUP BY COLUMN_NAME_1, COLUMN_NAME_2, etc. )
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
  • But that would delete all the rest of the rows also right? – Whitebeard13 Aug 24 '18 at 12:40
  • these should delete all the rows that have more then a rows .. (the row preserved for each all column group by is the row with max(rowid) ..but you can use MIN(rowid ) if you want preserve the older ) anyway al least one row for each group column is not deleted .. ... make a copy before try anyway or better work on a copy .. for test .. – ScaisEdge Aug 24 '18 at 12:49
  • By applying your method I retrieve only the entries for which I have triplicates or quadruplicates. – Whitebeard13 Aug 24 '18 at 12:57
  • @Whitebeard13 - What do you mean by that? scaisEdge's method does not **retrieve** anything, it **deletes** rows from the table. It should delete duplicates too, not only triplicates etc. –  Aug 24 '18 at 13:28
  • It indeed deletes duplicates and triplicates but also the rest of the rows keeping only those MAX(ROWID)s. – Whitebeard13 Aug 24 '18 at 13:31
  • each rows group by your column have one row with max(row) so are delete only duplicated , triplicated .. and so on but one rows for each type remain .. the single rows remain because max(rowid) is the same of the rowid of the single row .. try – ScaisEdge Aug 24 '18 at 13:34
1

I wouldn't use not in or even min or max. I usually use something like

delete demo where rowid in
( select lead(rowid) over(partition by your_grouping_column order by col1, col2)
  from demo )

lead(rowid) returns all the 'next' rowids in order of col1, col2. The first row for each your_grouping_column (whatever that is) won't be in that list.

William Robertson
  • 15,273
  • 4
  • 38
  • 44
0

I don't know if the following is an efficient answer. Please let me know if so.

Assuming that the Table_V1 is the one including the duplicates, I created a second version of it which includes the count of duplicates next to the original data

CREATE TABLE Table_V2 AS ( SELECT A.*, B.CNT FROM Table_V1 A LEFT JOIN (SELECT * FROM (SELECT COL1, COL2, COUNT(COL1) AS CNT FROM Table_V1 GROUP BY COL1, COL2) WHERE CNT>1) B ON A.COL1 = B.COL1 AND A.COL2 = B.COL2);

Then I delete the duplicates/triplicates/quadruplicates, etc., using the following lines:

DELETE FROM Table_V2 WHERE CNT IS NOT NULL AND ROWID NOT IN ( SELECT MAX (ROWID) FROM Table_V2 WHERE CNT IS NOT NULL GROUP BY COL1, COL2);

Whitebeard13
  • 411
  • 1
  • 7
  • 17