12

I now have a table that has 604 000 row. I would like to drop 4000 random rows so my table will only contains 600 000 entries.

Would there be a quick way to do so ?

Thanks a lot.

gbn
  • 422,506
  • 82
  • 585
  • 676
silkAdmin
  • 4,640
  • 10
  • 52
  • 83

4 Answers4

22

In theory, this will be random and fast. In practise, it will be only fast:

DELETE FROM tableX
LIMIT 4000

This will be random but terribly slow, with 600K rows:

DELETE FROM tableX
ORDER BY RAND() 
LIMIT 4000

This won't be truly random (as there are usually gaps in the ids) and it may not even delete exactly 4000 rows (but a few less when there are many gaps) but it's probably faster than the previous.

An extra wrapping in a subquery is needed because the syntax for Delete from multiple tables does not allow LIMIT:

DELETE td
FROM
      tableX AS td
  JOIN 
      ( SELECT t.id
        FROM
             tableX AS t 
          CROSS JOIN
              ( SELECT MAX(id) AS maxid 
                FROM tableX
              ) AS m
          JOIN
              ( SELECT RAND() AS rndm
                FROM tableX AS tr
                LIMIT 5000
              ) AS r
            ON 
               t.id = CEIL( rndm * maxid )
        LIMIT 4000
      ) AS x
      ON 
         x.id = td.id 

Explain output (of the subquery, from a 400K rows table):

id           table        possible_keys  key_len     rows 
 select_type         type             key     ref         Extra
1 PRIMARY <derived2> system                             1
1 PRIMARY <derived3> ALL                             5000
1 PRIMARY t          eq_ref PRIMARY PRIMARY 4 func      1 Using where;Using index
3 DERIVED tr         index          PRIMARY 4      398681 Using index
2 DERIVED                                            Select tables optimized away
ypercubeᵀᴹ
  • 113,259
  • 19
  • 174
  • 235
  • Note that it may delete less than 4000 rows! – ypercubeᵀᴹ Jan 26 '12 at 12:10
  • I reckon that `ORDER BY ROUND(RAND(), )` would be faster, and the less n, the faster. Example, n=2, n=3. Obviously it would not be as much shuffled, but that can still help a great deal when speed is a concern. – Fabien Haddadi Jun 17 '21 at 10:07
  • @FabienHaddadi I don't see how it can be fast on a big table. If the table has 600k rows, it will still have to sort 600k values. Plus you have added 600k calls to ROUND() function ;) – ypercubeᵀᴹ Jun 17 '21 at 10:18
  • Because it's got less variety to sort out of. It creates groups of similar values with smaller numbers. I myself sort faster just apples and bananas, than 12 kinds of vegies. To be checked in terms of perf. – Fabien Haddadi Jun 17 '21 at 10:38
  • Such sorting may be faster with a specialized sort algorithm. I don't think the engine can possibly know that or have more than one or two general algorithms available for sorting. But testing it would be good. Let me know what you find! – ypercubeᵀᴹ Jun 17 '21 at 11:04
3
delete from yourTable limit 4000
juergen d
  • 201,996
  • 37
  • 293
  • 362
  • this is random. You cannot tell which rows will be deleted. – juergen d Jan 26 '12 at 10:37
  • Wouldn't mysql do it by native order in database? (eg. first 4000 lowest IDs) – Vyktor Jan 26 '12 at 10:39
  • Could be but you cannot rely on that. See here http://stackoverflow.com/questions/8746519/sql-what-is-the-default-order-by-of-queries – juergen d Jan 26 '12 at 10:43
  • This is likely to delete contiguous rows in MySQL because of MySQL and InnoDB quirks. There is less chance in other RDBMS. However, it isn't random *enough* to be "random". So -1 because this is likely to be predicatable – gbn Jan 26 '12 at 10:53
  • I think the OP just wants to delete a number of rows no matter which. And this query does exactly that. – juergen d Jan 26 '12 at 10:58
2

If I had to venture a guess:

DELETE FROM table where id = (SELECT id FROM table ORDER BY rand() LIMIT 1) LIMIT 10
Kristoffer Sall-Storgaard
  • 10,576
  • 5
  • 36
  • 46
0
DELETE FROM TABLE ORDER BY RAND() LIMIT 4000;

It'll take time though...

A quicker way to execute (not write the code!) may be 4000 separate deletes in a loop

DELETE FROM TABLE WHERE AssumedPKisInt = <ARandomNumber>

Of course, you need to ensure you don't try to delete non-existing or already-deleted rows.

gbn
  • 422,506
  • 82
  • 585
  • 676