0

Please consider following table

Table Name: mytable

model_id event_name time_of_event
9 CREATE 2016-01-01 00:00:00
9 UPDATE 2016-01-01 01:00:00
9 DELETE 2016-01-01 02:00:00
3 CREATE 2016-01-01 03:00:00       DUPLICATE
3 CREATE 2016-01-01 03:00:00       DUPLICATE delete this
3 DELETE 2016-01-01 04:00:00

How to delete 5th entry from above table i.e. delete row with exactly same value from table. In above example no column is unique.

Please keep in mind that database could be huge and I don't want to recreate or republish data with distinct values into the table.

// Use below code to create above example 

CREATE TABLE mytable(
  model_id  integer,
  event_name varchar(7),
  time_of_event timestamp
);

INSERT INTO mytable 
    (model_id, event_name, time_of_event) 
    VALUES
    (9, 'CREATE', '2016-01-01 00:00:00'),
    (9, 'UPDATE', '2016-01-01 01:00:00'),
    (9, 'DELETE', '2016-01-01 02:00:00'), 
    (3, 'CREATE', '2016-01-01 03:00:00'),
    (3, 'CREATE', '2016-01-01 03:00:00'),
    (3, 'DELETE', '2016-01-01 04:00:00');
    
SELECT * FROM mytable;
afzalex
  • 8,598
  • 2
  • 34
  • 61
  • 1
    I removed the conflicting DBMS tags. Please add only **one** tag for the database product you are really using. –  May 08 '22 at 13:07
  • 1
    Does this answer your question? https://stackoverflow.com/questions/26769454/how-to-delete-duplicate-rows-without-unique-identifier – alchemist May 08 '22 at 13:10
  • What DBMS are you using? – Zakk May 08 '22 at 13:16
  • @alchemist Question's title is perfect, but in that question `id` column is unique, and answerers are using that fact. If somebody knows answer, you can post your answer there, and mark this as duplicate. – afzalex May 08 '22 at 13:19
  • @afzalex _"but in that question id column is unique"_ Whaaatt? How did you end up in this situation then? – Zakk May 08 '22 at 13:20
  • @Zakk In perfect world all table should have unique identifier column, but table with no unique columns could be created, and values could be added in it. – afzalex May 08 '22 at 13:22
  • @afzalex _"In perfect world all table should have unique identifier column"_ NO! Not in perfect world. In **real world**, all tables should have a unique identifier column. – Zakk May 08 '22 at 13:25
  • @Zakk The question is, how to delete duplicate row from table if no column is unique? (table with no unique column could be created in **real world**). In such situation how will you delete those entries? – afzalex May 08 '22 at 13:28
  • @afzalex I will try to provide an answer. – Zakk May 08 '22 at 13:29
  • @afzalex I don't think there is a way can make it without unique contain or PK and limit `I don't want to recreate or republish data with distinct values into the table.` in mysql – D-Shih May 08 '22 at 14:45

2 Answers2

1

Try with a helper table that contains the duplicates, but only one each:

With this scenario: ...

CREATE TABLE mytable(
model_id integer,event_name VARCHAR(8),time_of_event TIMESTAMP)
; 

INSERT INTO mytable
-- your input data ...
          SELECT 9,'CREATE',TIMESTAMP '2016-01-01 00:00:00'
UNION ALL SELECT 9,'UPDATE',TIMESTAMP '2016-01-01 01:00:00'
UNION ALL SELECT 9,'DELETE',TIMESTAMP '2016-01-01 02:00:00'
UNION ALL SELECT 3,'CREATE',TIMESTAMP '2016-01-01 03:00:00'
UNION ALL SELECT 3,'CREATE',TIMESTAMP '2016-01-01 03:00:00'
UNION ALL SELECT 3,'DELETE',TIMESTAMP '2016-01-01 04:00:00'
;

Create your helper table like so:

CREATE TABLE helper AS
SELECT
  model_id
, event_name
, time_of_event
FROM mytable
GROUP BY 
  model_id
, event_name
, time_of_event
HAVING COUNT(*) > 1;

Then, use the helper table to delete ... you will delete all rows, not only one of the duplicates ...

DELETE FROM mytable
WHERE(model_id,event_name,time_of_event) IN (
SELECT model_id,event_name,time_of_event FROM helper
);

And finally, insert all the rows from the helper table back in again:

INSERT INTO mytable
SELECT * FROM helper;                                                                                                                                                                              
COMMIT; -- if your connection is not auto-commit ...

But I'd like to add that, for most database systems, the other approach - to create a new table containing SELECT DISTINCT * FROM old_table is the faster alternative as soon as we are talking about around 20 to 25 % of the total row count.

marcothesane
  • 6,192
  • 1
  • 11
  • 21
1

Having two or more rows with identical values is a sign of very bad design. I suppose model_id is the table's primary key. I wonder how did you end up in this situation.

I don't want to recreate or republish data with distinct values into the table.

One possible solution is to add (not recreate/republish) a column with unique values to your table, then delete the duplicate rows you want.

ALTER TABLE mytable ADD COLUMN MyTableID INT FIRST;

You need to fill this column with unique values:

SET @i := 0;
UPDATE MyTable SET MyTableID = @i:=(@i+1) WHERE 1=1;

Next, you can write the following query:

SELECT
  MT.MyTableID, MT.model_id, MT.event_name, MT.time_of_event
FROM
  MyTable MT,
  (SELECT model_id, event_name, time_of_event, COUNT(*)
   FROM MyTable
   GROUP BY model_id, event_name, time_of_event
   HAVING COUNT(*) > 1
  ) TmpTable
WHERE 
  MT.model_id = TmpTable.model_id
  AND MT.event_name = TmpTable.event_name
  AND MT.time_of_event = TmpTable.time_of_event
;

Result:

MyTableID model_id event_name time_of_event
4 3 CREATE 2016-01-01 03:00:00
5 3 CREATE 2016-01-01 03:00:00

You can now proceed with the deletion of duplicate rows:

DELETE FROM MyTable WHERE MyTableID IN (5 /*, the ones you wish */);

If you have too many duplicate values and you can't afford to delete them manually, you can do it like this:

DELETE FROM MyTable WHERE MyTableID IN (
  SELECT
    MT.MyTableID
  FROM
    (SELECT * FROM MyTable) AS MT,
    (SELECT model_id, event_name, time_of_event, COUNT(*)
     FROM MyTable
     GROUP BY model_id, event_name, time_of_event
     HAVING COUNT(*) > 1
    ) AS TmpTable
  WHERE 
    MT.model_id = TmpTable.model_id
    AND MT.event_name = TmpTable.event_name
    AND MT.time_of_event = TmpTable.time_of_event
) LIMIT /* The number of duplicate rows - 1 */;

The -1 is to preserve one row of the duplicates. If you want to delete them all, remove the LIMIT clause.

Zakk
  • 1,935
  • 1
  • 6
  • 17
  • I ended up in this situation because it is asked to me by an interviewer. when I said about using temporary table, his next question was, what if table is really huge. I knew it could be done somehow with row numbers, but don't know the solution. – afzalex May 08 '22 at 19:13
  • @afzalex _"I knew it could be done somehow with row numbers"_ This solution is RDBMS-dependant. Mine is not. – Zakk May 08 '22 at 19:37
  • @afzalex But does this answer your question? i.e. without creating a temporary table? – Zakk May 08 '22 at 19:41