1

I'm have a database that continuously receives data from external processes. I just discovered that occasionally some bad data is inserted. Finding the bad data is easy:

mysql> CREATE TEMPORARY TABLE singletons
    ->     AS SELECT
    ->         COUNT(*) number_rows,
    ->         name property
    ->     FROM data_property
    ->     GROUP BY property
    ->     HAVING number_rows = 1;
Query OK, 1045 rows affected (1 min 45.53 sec)
Records: 1045  Duplicates: 0  Warnings: 0

The data_property table has 44,694,240 rows. Deleting a row takes about five minutes. I can't afford to lock the table for the week or so it would take to remove all the bad data at once, so I wrote a stored procedure to delete small batches:

CREATE TEMPORARY TABLE delete_me LIKE singletons;
DELIMITER %%
CREATE PROCEDURE delete_bad_rows(v_start INT, v_step INT, v_end INT)
BEGIN
    DECLARE v_offset INT DEFAULT v_start;
    WHILE v_offset < v_end
    DO
        TRUNCATE TABLE delete_me;
        INSERT INTO delete_me
            SELECT * FROM singletons
                ORDER BY property
                LIMIT v_offset, v_step;
        DELETE LOW_PRIORITY data_property
            FROM data_property, delete_me
            WHERE name=property;
        SET v_offset = v_offset + v_step;
    END WHILE;
END;
%%
DELIMITER ;
CALL delete_bad_rows(0,3,3);

Running from the command line, that works just fine, and since there are additional SELECT statements, I can watch what's happening. So I tried this:

echo "SOURCE delete_me.sql; CALL delete_bad_rows(0,3,1045);" | \
nohup ./bin/mysql -u root -p mydata --password=xxxxxxxx >delete_me.log

When I run it this way, nothing is written to delete_me.log until I kill the process, whereupon all the output appears at once.

So, is there any way to turn off or prevent the buffering?

Or better yet, is there a way to speed up the deletion?

mysql> DESCRIBE data_property;
+------------+---------------+------+-----+---------+-------+
| Field      | Type          | Null | Key | Default | Extra |
+------------+---------------+------+-----+---------+-------+
| variableid | bigint(20)    | NO   | PRI | NULL    |       |
| name       | char(8)       | NO   | PRI | NULL    |       |
| value      | varchar(1024) | NO   |     | NULL    |       |
+------------+---------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

mysql> DESCRIBE delete_me;
+-------------+------------+------+-----+---------+-------+
| Field       | Type       | Null | Key | Default | Extra |
+-------------+------------+------+-----+---------+-------+
| number_rows | bigint(21) | NO   |     | 0       |       |
| property    | char(8)    | NO   |     | NULL    |       |
+-------------+------------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql> EXPLAIN DELETE data_property FROM data_property, delete_me WHERE name=property;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra                          |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
|  1 | DELETE      | NULL  | NULL       | NULL | NULL          | NULL | NULL    | NULL | NULL |     NULL | no matching row in const table |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
1 row in set (8 min 32.90 sec)
samwyse
  • 2,760
  • 1
  • 27
  • 38

1 Answers1

2

To answer your first question, yes, there is an option to flush stdout:

% mysql --help | grep buf
  -n, --unbuffered    Flush buffer after each query.

To answer your second question, I would avoid the temp tables and the stored procedure (I avoid stored procedures in MySQL if at all possible). Instead, generate a series of DELETE statements:

SELECT CONCAT(
  'DELETE FROM data_property WHERE name = ', QUOTE(name), ';'
) AS _sql
FROM data_property
GROUP BY name
HAVING COUNT(*) = 1;

The result will be a set of DELETE statements, one for each property that needs to be deleted. Save this result as an SQL script:

mysql -BN gen_deletes.sql > deletes.sql

Then run the DELETEs one at a time, using the option to flush after each query so you can watch progress.

mysql -v -n < deletes.sql

To make this run faster, make sure your table has an index on name so it won't incur a table-scan on each DELETE.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
  • i never noticed the --unbuffered option before. thanks! – samwyse Mar 30 '23 at 21:47
  • the table has an index on (variableid, name) so i'm looking to see if i can cheaply grab both of those to build the delete statement. since new data comes in every few minutes, i can't let afford for the db to wait very long for my selects to finish. – samwyse Mar 30 '23 at 21:47
  • The WHERE clause in your DELETE is only referencing `name`, so an index on `(variableid, name)` is not going to help. You need an index where `name` is the first column. See my answer to https://stackoverflow.com/questions/24315151/does-order-of-fields-of-multi-column-index-in-mysql-matter/24315527#24315527 – Bill Karwin Apr 04 '23 at 20:50