1

This is a second post to my original question posted here.

My setup:
amazon RDS using MySQL Workbench with connection timeout set to max

I am trying to DELETE duplicate rows from my dB which has close to 1MIL rows.

the table looks like this, mytext is a mediumtext blob. id is AUTO_INCREMENT

+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc  | 123   |
| 2 | joe | min | abc  | 123   |
| 3 | mar | kam | def  | 789   |
| 4 | kel | smi | ghi  | 456   |
+------------------------------+ 

I would like to end up with a table like this

+---+-----+-----+------+-------+
|id |fname|lname|mytext|morevar|
|---|-----|-----|------|-------|
| 1 | joe | min | abc  | 123   |
| 3 | mar | kam | def  | 789   |
| 4 | kel | smi | ghi  | 456   |
+------------------------------+  

This solution started woking but after about 10,000 rows the process takes longer and eventualy hangs.

I let this run for over 20 hours, settings at 10 thousand rows with a WHERE condition (i thought deleting in chunks would be safer).

But even with the WHERE clause the system hangs then I have to Reboot RDS to access the dB.

DELETE
FROM yourTable
  WHERE id>40000 
   AND id<=50000
AND id NOT IN
(
        SELECT MAXID FROM
        (
                SELECT MAX(id) as MAXID
                FROM yourTable
                GROUP BY mytext
        ) as temp_table
)

heres the create statement

CREATE TABLE `yourTable` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `fname` varchar(45) DEFAULT NULL,
    `lname` varchar(45) DEFAULT NULL,
    `mytext` mediumtext,
    `morevar` bigint(20) DEFAULT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1$$

Question
Is this sql command ok for handeling large amounts of rows and what I am trying to achieve? Or is there a better solution.

How long would it normally take to process 1MIL rows?

Is there a setting like in php.ini inside amazon for large data set manipulation?

Or would it make more sense to create a new table and insert all rows excluding duplicates?

Community
  • 1
  • 1
t q
  • 4,593
  • 8
  • 56
  • 91
  • What indexes do you have on that table? *(Also, note that your query will delete row 1, not row 2, though I suspect that's not relevant...)* – MatBailie Jul 10 '13 at 20:02
  • Are any of the non-ID fields indexed? If so, you could loop through those values to delete in chunks. That's not a particularly large table, but I'm not terribly familiar with MySQL – Hart CO Jul 10 '13 at 20:03
  • @GoatCO im not really sure what you mean by `indexed` – t q Jul 10 '13 at 20:12
  • @tq - An index is a RDBMS structure designed to make searched the data simpler. In your case you only have one index, and that's the `PRIMARY KEY`. What this means is that the only way to find the `MAX(id)` for a specific `myText` is to check every single row in the whole table. Which will be exceptionally slow. I'd recommend checking out how to create an Index and how they're used / relevant. – MatBailie Jul 10 '13 at 20:14

1 Answers1

1

I really wouldn't use NOT IN.

I would ensure that there is an index on myText, id and then try this...

DELETE
FROM
  yourTable
WHERE
      id >  40000
  AND id <= 50000
  AND EXISTS (SELECT *
                FROM yourTable AS lookup
               WHERE lookup.myText = yourTable.myText
                 AND lookup.id    >  yourTable.id
             )

This way you only check the myText values that you are potentially deleting.

Where as your sub-query will return ids for myTexts that don't even appear in the range you are checking.

MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • thank you, will try this. will this keep 1 copy of the duplicate? – t q Jul 10 '13 at 20:13
  • @tq - I made a typo. The `>=` should be `>`. This will delete all rows where another row exists for the same `myText` but with a higher `id`. Therefor, the row with the highest `id` for any `myText` will *not* be deleted. Do note, however, that this ***is*** predicated on you creating that index, which itself is going to be slow. – MatBailie Jul 10 '13 at 20:17
  • i seem to be getting an error `Error Code: 1093. You can't specify target table 'yourTable' for update in FROM clause` – t q Jul 10 '13 at 20:32