1

I have one huge table that I would like to make smaller. It has ~230 Million rows.

Both columns are indexed. The structure is:

+--------------+------------+
| id_my_value     | id_ref     |
+--------------+------------+
|         YYYY |       XXXX |
+--------------+------------+

I would have to remove the values that have a particular "id_ref" value. I have tried the following:

sql = f"SELECT id_ref FROM REFS"
cursor.execute(sql)
refs = cursor.fetchall()
limit = 1000
for current in refs:
    id = current["id_ref"]
    sql = f"DELETE FROM MY_VALUES WHERE id_ref = {id} LIMIT {limit}" 
    while True:
      cursor.execute(sql)
      mydb.commit()
      if cursor.rowcount == 0:
        break

Regardless the value I set to "limit" the query is tremendously slow:

DELETE FROM MY_VALUES WHERE id_ref = XXXX LIMIT 10;

I have also tried the other way around. Select the id_value associated with a particular id_ref, and delete:

SELECT id_value FROM MY_VALUES WHERE id_ref = XXXX LIMIT 10
DELETE FROM MY_VALUES WHERE id_value = YYYY

Here is my EXPLAIN.

EXPLAIN DELETE FROM MY_VALUES WHERE id_ref = YYYY LIMIT 1000; 
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | 
+----+-------------+------------+------------+-------+---------------+------------+---------+-------+----------+----------+-------------+ 
| 1 | DELETE       | MY_VALUES   | NULL | range | id_ref | id_ref | 5 | const | 20647922 | 100.00 | Using where | 

It does use the right INDEX.

I would not have any problem to have this operation running for days ont he server.

  1. What is the right way to approach this "cleaning"?

EDIT

Here is the output from SHOW CREATE TABLE MY_VALUES

MY_VALUES | CREATE TABLE `MY_VALUES` (
  `id_my_value` int NOT NULL AUTO_INCREMENT,
  `id_document` int NOT NULL,
  `id_ref` int DEFAULT NULL,
  `value` mediumtext CHARACTER SET utf8 COLLATE utf8_spanish_ci,
  `weigth` int DEFAULT NULL,
  `id_analysis` int DEFAULT NULL,
  `url` text CHARACTER SET utf8 COLLATE utf8_spanish_ci,
  `domain` varchar(64) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
  `filetype` varchar(16) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
  `id_domain` int DEFAULT NULL,
  `id_city` int DEFAULT NULL,
  `city_name` varchar(32) CHARACTER SET utf8 COLLATE utf8_spanish_ci DEFAULT NULL,
  `is_hidden` tinyint NOT NULL DEFAULT '0',
  `id_company` int DEFAULT NULL,
  `is_hidden_by_user` tinyint(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id_my_value`),
  KEY `id_ref` (`id_ref`),
  KEY `id_document` (`id_document`),
  KEY `id_analysis` (`id_analysis`),
  KEY `weigth` (`weigth`),
  KEY `id_domain` (`id_domain`),
  KEY `id_city` (`id_city`),
  KEY `id_company` (`id_company`),
  KEY `value` (`value`(15))

UPDATE

I just tried to remove one register:

DELETE FROM MY_VALUES WHERE id_MY_VALUE = 8

That operation takes "forever". To prevent a timeout, I followed this SO question ,so I have set:

show variables like 'innodb_lock_wait_timeout';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| innodb_lock_wait_timeout | 100000 |
+--------------------------+--------+
Javi M
  • 97
  • 9
  • 1
    You query will not run as VALUES is a reserved word, also if you don't care if it goes on for days so why does it bother you when the delete query runs extremly slow, whatever that means – nbk May 13 '21 at 12:38
  • 1
    Show us the output of `SHOW CREATE TABLE tablename`. Please [edit] your question. – O. Jones May 13 '21 at 13:20
  • @nbk, I have edited the question to avoid that confusion. The column is not named "VALUES". I just reflected it. – Javi M May 13 '21 at 13:20
  • Good, it irritate me, ti your quetsion see, see with EYPLAIN if your qiery uses an index on id_ref es make one and see it it is used – nbk May 13 '21 at 13:27
  • EXPLAIN DELETE FROM MY_VALUES WHERE id_ref = YYYY LIMIT 1000; | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+-------+---------------+------------+---------+-------+----------+----------+-------------+ | 1 | DELETE | MY_VALUES | NULL | range | id_ref | id_ref | 5 | const | 20647922 | 100.00 | Using where | It does use the right INDEX – Javi M May 13 '21 at 13:36
  • Do you have either column indexed? – Rick James May 13 '21 at 13:58
  • See this for several "fast" deletion techniques: http://mysql.rjweb.org/doc.php/deletebig – Rick James May 13 '21 at 13:59
  • Still need to see the output from `SHOW CREATE TABLE MY_VALUES`, Notice that you need to show all the columns and all the indexes. We can't help you if you don't show us the whole table. You can., of course, change the column names. Please [edit] your question. – O. Jones May 13 '21 at 15:04
  • @O.Jones, already edited. :) – Javi M May 13 '21 at 16:09

3 Answers3

0
a=0;
limit=1000;
while true
    b=a+1000
    sql = "delete from VALUES where id>{a} and id<={b}"
     cursor.execute(sql)
     mydb.commit()
     if cursor.rowcount == 0:
          break
     a=a+1000
Tom
  • 23
  • 7
  • 2
    please always add some explanation why this would work better – nbk May 13 '21 at 12:41
  • @Tom, This answer assumes the "ids" are correlative, and sadly that is not the case. The ids. are "random" numbers. – Javi M May 13 '21 at 13:18
0

First thing to try. Put this right after your second cursor.execute().

cnx.commit()

In connector/python, autocommit is turned off by default. If you don't commit, your MySQL server buffers up all your changes (DELETEs in your case) so it can roll them back if you choose, or if your program crashes.

I guess your slow query is

DELETE FROM `VALUES` WHERE id_ref=constant LIMIT 1000;

Try doing this. EXPLAIN shows you the query plan.

EXPLAIN DELETE FROM `VALUES` WHERE id_ref=constant LIMIT 1000;

It should employ the index on your id_ref row. It's possible your indexes aren't selective enough so your query planner chooses a table scan. In that case you might consider raising the LIMIT so your query does more work each time it runs.

You could try this. If my guess about the table scan is correct, it might help.

DELETE FROM `VALUES` FORCE INDEX (your_index_on_id_ref) WHERE id_ref=constant LIMIT 1000;

(Usually FORCE INDEX is a terrible idea. But this might be the exception.)

You could also try this: create a cleaned up temporary table, then rename tables to put it into service.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
CREATE TABLE purged_values AS 
SELECT * 
  FROM `VALUES`
 WHERE id_ref NOT IN (SELECT id_ref FROM `REFS`);

This will take a while. Run it at zero-dark-thirty. The transaction isolation level helps prevent contention with other sessions using the table while this is in progress.

Then you'll have a new, purged, table. You can index it, then do these renames to put it into service.

ALTER TABLE `VALUES` RENAME TO old_values;
ALTER TABLE purged_values RENAME to `VALUES';
O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • I have run the EXPLAIN comand with a real example. It is taking the right index: | 1 | DELETE | METAVALUES | NULL | range | id_ref | id_ref | 5 | const | 20647922 | 100.00 | Using where | – Javi M May 13 '21 at 13:32
0

Finally I did a bit more experimentation and found a way.

First step

The python loop to delete the entries on the DB was running for ~12h. I addedcouple of lines to measure the time execution:

      start = time.time()
      cursor.execute(sql)
      mydb.commit()
      end = time.time()

Here is a sample of the first measurements:

     1 > 900 > 0.4072246551513672
     2 > 900 > 1.7270898818969727
     3 > 900 > 1.8365845680236816
     4 > 900 > 1.124634027481079
     5 > 900 > 1.8552422523498535
     6 > 900 > 13.80513596534729
     7 > 900 > 8.379877090454102
     8 > 900 > 10.675175428390503
     9 > 900 > 6.14388370513916
     10 > 900 > 11.806004762649536
     11 > 900 > 12.884040117263794
     12 > 900 > 23.604055881500244
     13 > 900 > 19.162535905838013
     14 > 900 > 24.980825662612915
     ....

It went for an average of ~30s per execution after 900 iterations. Picture attached for reference:Eceution time per same query

In my case this have would taken ~80 days to remove all the rows with this implementation.

Final solution

Created a temporary table with the appropiate value, index, etc...

CREATE TABLE ZZ_MY_VALUES AS 
    SELECT * FROM ZZ_MY_VALUES WHERE ZZ_MY_VALUES.id_ref IN 
    (
        SELECT id_ref FROM MY_REFS WHERE id_ref = 3 OR id_ref = 4 OR id_ref = 5
    )

It took ~3h and went from 230M rows to 21M rows. A bit quicker than the orignal statimation of 3 months. :)

Thanks all for your tips.

Javi M
  • 97
  • 9