5

In order to trim a production database for loading in a test system, we've deleted rows in many tables. This now left us with cruft in a couple of tables, namely rows which aren't used in any FK relation anymore. What I want to achieve is like the garbage collection in Java.

Or to put it another way: If I have M tables in the database. N of them (i.e. most but not all) have foreign key relations. I've deleted a couple of high level rows (i.e. which only have outgoing FK relations) via SQL. This leaves the rows in the related tables alone.

Does someone have a SQL stored procedure or a Java program which finds the N tables and then follows all the FK relations to delete rows which are no longer needed.

If finding the N tables to too complex, I could probably provide the script a list of tables to scan or, preferably, a negative list of tables to ignore.

Also note:

  1. We have some tables which are used in many (>50) FK relations, i.e. A, B, C, ... all use rows in Z.
  2. All FK relations use the technical PK column which is always a single column.
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • This question shows how to collect FK related information: http://stackoverflow.com/questions/273794/mysql-how-to-determine-foreign-key-relationships-programmatically – Aaron Digulla Oct 08 '14 at 12:58
  • This question explains how to delete all rows related to a single FK relation: http://stackoverflow.com/questions/3164840/mysql-attempting-to-delete-all-rows-which-are-not-constrained-by-foreign-key – Aaron Digulla Oct 08 '14 at 13:58
  • Dynamic SQL in stored procedure: http://stackoverflow.com/questions/190776/how-to-have-dynamic-sql-in-mysql-stored-procedure – Aaron Digulla Oct 16 '14 at 09:55
  • Let me restate the spec in a shorter form to ensure I understand it: the objective is to locate every table that is referenced by any foreign key constraint, then identify in each of those tables every row which is *not* referenced from anywhere and delete it? And to do this repeatedly until there is nothing left to delete? (Or at least to do something equivalent to this process in terms of the end result produced.) – Mark Amery Oct 16 '14 at 21:58
  • @MarkAmery: Yes. A similar term would be "delete orphaned rows". Note that our DB has some tables with a lot of FK relations (i.e. `A1` to `A12` all have FK relations to `B`), so the resulting queries must join everything where `INFORMATION_SCHEMA.REFERENCED_TABLE_NAME` is the same. On a positive note, all our FK relations are to PK columns (i.e. all relations use a single column). – Aaron Digulla Oct 20 '14 at 09:14

3 Answers3

5

This issue is addressed in the MySQL Performance blog, http://www.percona.com/blog/2011/11/18/eventual-consistency-in-mysql/

He provides the following meta query, to generate queries that will identify orphaned nodes;

SELECT CONCAT(
 'SELECT ', GROUP_CONCAT(DISTINCT CONCAT(K.CONSTRAINT_NAME, '.', P.COLUMN_NAME,
  ' AS `', P.TABLE_SCHEMA, '.', P.TABLE_NAME, '.', P.COLUMN_NAME, '`') ORDER BY P.ORDINAL_POSITION), ' ',
 'FROM ', K.TABLE_SCHEMA, '.', K.TABLE_NAME, ' AS ', K.CONSTRAINT_NAME, ' ',
 'LEFT OUTER JOIN ', K.REFERENCED_TABLE_SCHEMA, '.', K.REFERENCED_TABLE_NAME, ' AS ', K.REFERENCED_TABLE_NAME, ' ',
 ' ON (', GROUP_CONCAT(CONCAT(K.CONSTRAINT_NAME, '.', K.COLUMN_NAME) ORDER BY K.ORDINAL_POSITION),
 ') = (', GROUP_CONCAT(CONCAT(K.REFERENCED_TABLE_NAME, '.', K.REFERENCED_COLUMN_NAME) ORDER BY K.ORDINAL_POSITION), ') ',
 'WHERE ', K.REFERENCED_TABLE_NAME, '.', K.REFERENCED_COLUMN_NAME, ' IS NULL;'
 ) AS _SQL
 FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE K
 INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE P
 ON (K.TABLE_SCHEMA, K.TABLE_NAME) = (P.TABLE_SCHEMA, P.TABLE_NAME)
 AND P.CONSTRAINT_NAME = 'PRIMARY'
 WHERE K.REFERENCED_TABLE_NAME IS NOT NULL
 GROUP BY K.CONSTRAINT_NAME;

I converted this to find childless parents, producing;

SELECT CONCAT(
 'SELECT ', GROUP_CONCAT(CONCAT(K.REFERENCED_TABLE_NAME, '.', K.REFERENCED_COLUMN_NAME) ORDER BY K.ORDINAL_POSITION), ' ',

 'FROM ', K.REFERENCED_TABLE_SCHEMA, '.', K.REFERENCED_TABLE_NAME, ' AS ', K.REFERENCED_TABLE_NAME, ' ',
 'LEFT OUTER JOIN ', K.TABLE_SCHEMA, '.', K.TABLE_NAME, ' AS ', K.CONSTRAINT_NAME, ' ',
 ' ON (', GROUP_CONCAT(CONCAT(K.CONSTRAINT_NAME, '.', K.COLUMN_NAME) ORDER BY K.ORDINAL_POSITION),
 ') = (', GROUP_CONCAT(CONCAT(K.REFERENCED_TABLE_NAME, '.', K.REFERENCED_COLUMN_NAME) ORDER BY K.ORDINAL_POSITION), ') ',
 'WHERE ', K.CONSTRAINT_NAME, '.', K.COLUMN_NAME, ' IS NULL;'
 ) AS _SQL
 FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE K
 INNER JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE P
 ON (K.TABLE_SCHEMA, K.TABLE_NAME) = (P.TABLE_SCHEMA, P.TABLE_NAME)
 AND P.CONSTRAINT_NAME = 'PRIMARY'
 WHERE K.REFERENCED_TABLE_NAME IS NOT NULL
 GROUP BY K.CONSTRAINT_NAME;
harvey
  • 2,945
  • 9
  • 10
  • 1
    The first query is almost what I need but it has two bugs. One I could solve - limit the query to one schema by adding `AND K.TABLE_SCHEMA = '...'` before the `GROUP BY`. The other problem is that it doesn't join enough tables when A and B have a FK relation to C. In this case, I get two queries. The query for A will report all nodes in B as orphaned and vice versa. Can you fix that (join all rows with the same `REFERENCED_TABLE_NAME` in a single query)? – Aaron Digulla Oct 20 '14 at 09:09
  • Just need some clarification, the first query finds children with no parents - Are you sure this is what you want? From your original request you were after parents that had no children - second query. Secondly, you can go multiple levels, but you need to alias the schema table for each level - how deep to you want to go? – harvey Oct 20 '14 at 23:02
  • I want orphaned rows; I'm not sure why you think I wanted childless parents. Maybe a misunderstanding? From what I know, I can't delete rows which are used in a FK relation (i.e. I can't delete a child/dependency as long as a parent row still contains the PK of the child/dep). That means I can't create childless parents without disabling the FK relations first. – Aaron Digulla Oct 21 '14 at 08:14
  • re levels: That's an interesting question which I haven't considered so far. I'm not sure it's relevant, though. As long as one child/dependency row is referenced by at least one parent anywhere, I want to keep it. So one level should be enough. Am I missing something? – Aaron Digulla Oct 21 '14 at 08:20
3

Even simple stored procedures are usually a little ugly, and this was an interesting exercise in pushing stored procedures well beyond the point where it's easy to take them.

To use the code below, launch your MySQL shell, use your target database, paste the big block of stored procedures from below, and then execute

CALL delete_orphans_from_all_tables();

to delete all orphaned rows from all tables in your database.

To provide a zoomed-out overview:

  • delete_orphans_from_all_tables is the entry point. All other sprocs are prefixed with dofat to make clear that they relate to delete_orphans_from_all_tables and make it less noisy to have them kicking around.
  • delete_orphans_from_all_tables works by calling dofat_delete_orphans_from_all_tables_iter repeatedly until there are no more rows to delete.
  • dofat_delete_orphans_from_all_tables_iter works by looping over all the tables that are targets of foreign key constraints, and for each table deleting all rows that currently aren't referenced from anywhere.

Here's the code:

delimiter //
CREATE PROCEDURE dofat_store_tables_targeted_by_foreign_keys ()
BEGIN
    -- This procedure creates a temporary table called TargetTableNames
    -- containing the names of all tables that are the target of any foreign
    -- key relation.

    SET @db_name = DATABASE();

    DROP TEMPORARY TABLE IF EXISTS TargetTableNames;
    CREATE TEMPORARY TABLE TargetTableNames (
        table_name VARCHAR(255) NOT NULL
    );

    PREPARE stmt FROM 
   'INSERT INTO TargetTableNames(table_name)
    SELECT DISTINCT referenced_table_name
    FROM INFORMATION_SCHEMA.key_column_usage
    WHERE referenced_table_schema = ?';

    EXECUTE stmt USING @db_name;
END//

CREATE PROCEDURE dofat_deletion_clause_for_table(
    IN table_name VARCHAR(255), OUT result text
)
DETERMINISTIC
BEGIN
    -- Given a table Foo, where Foo.col1 is referenced by Bar.col1, and
    -- Foo.col2 is referenced by Qwe.col3, this will return a string like:
    --
    -- NOT (Foo.col1 IN (SELECT col1 FROM BAR) <=> 1) AND
    -- NOT (Foo.col2 IN (SELECT col3 FROM Qwe) <=> 1)
    --
    -- This is used by dofat_delete_orphans_from_table to target only orphaned
    -- rows.
    --
    -- The odd-looking `NOT (x IN y <=> 1)` construct is used in favour of the
    -- more obvious (x NOT IN y) construct to handle nulls properly; note that
    -- (x NOT IN y) will evaluate to NULL if either x is NULL or if x is not in
    -- y and *any* value in y is NULL.

    SET @db_name = DATABASE();
    SET @table_name = table_name;

    PREPARE stmt FROM 
   'SELECT GROUP_CONCAT(
        CONCAT(
            \'NOT (\', @table_name, \'.\', referenced_column_name, \' IN (\',
            \'SELECT \', column_name, \' FROM \', table_name, \')\',
            \' <=> 1)\'
        )
        SEPARATOR \' AND \'
    ) INTO @result
    FROM INFORMATION_SCHEMA.key_column_usage 
    WHERE
        referenced_table_schema = ?
        AND referenced_table_name = ?';
    EXECUTE stmt USING @db_name, @table_name;

    SET result = @result;
END//

CREATE PROCEDURE dofat_delete_orphans_from_table (table_name varchar(255))
BEGIN
    -- Takes as an argument the name of a table that is the target of at least
    -- one foreign key.
    -- Deletes from that table all rows that are not currently referenced by
    -- any foreign key.

    CALL dofat_deletion_clause_for_table(table_name, @deletion_clause);
    SET @stmt = CONCAT(
       'DELETE FROM ', @table_name,
       ' WHERE ', @deletion_clause
    );

    PREPARE stmt FROM @stmt;
    EXECUTE stmt;
END//

CREATE PROCEDURE dofat_delete_orphans_from_all_tables_iter(
    OUT rows_deleted INT
)
BEGIN    
    -- dofat_store_tables_targeted_by_foreign_keys must be called before this
    -- will work.
    --
    -- Loops ONCE over all tables that are currently referenced by a foreign
    -- key. For each table, deletes all rows that are not currently referenced.
    -- Note that this is not guaranteed to leave all tables without orphans,
    -- since the deletion of rows from a table late in the sequence may leave
    -- rows from a table early in the sequence orphaned.
    DECLARE loop_done BOOL;

    -- Variable name needs to differ from the column name we use to populate it
    -- because of bug http://bugs.mysql.com/bug.php?id=28227
    DECLARE table_name_ VARCHAR(255); 

    DECLARE curs CURSOR FOR SELECT table_name FROM TargetTableNames;
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET loop_done = TRUE;

    SET rows_deleted = 0;
    SET loop_done = FALSE;

    OPEN curs;
    REPEAT
        FETCH curs INTO table_name_;
        CALL dofat_delete_orphans_from_table(table_name_);
        SET rows_deleted = rows_deleted + ROW_COUNT();
    UNTIL loop_done END REPEAT;
    CLOSE curs;
END//

CREATE PROCEDURE delete_orphans_from_all_tables ()
BEGIN    
    CALL dofat_store_tables_targeted_by_foreign_keys();
    REPEAT
        CALL dofat_delete_orphans_from_all_tables_iter(@rows_deleted);
    UNTIL @rows_deleted = 0 END REPEAT;
END//
delimiter ;

As an aside, this exercise has taught me about a few things that make writing code of this level of complexity using MySQL sprocs a frustrating business. I mention all these only because they may help you, or a curious future reader, understand what look like crazy stylistic choices in the code above.

  • Grossly verbose syntax and boilerplate for simple things. e.g.
    • needing to declare and assign on different lines
    • needing to set delimiters around procedure definitions
    • needing to use a PREPARE/EXECUTE combo to use dynamic SQL).
  • Utter lack of referential transparency:
    • PREPARE stmt FROM CONCAT( ... ); is a syntax error, while @foo = CONCAT( ... ); PREPARE stmt FROM @foo; is not.
    • EXECUTE stmt USING @foo is fine, but EXECUTE stmt USING foo where foo is a procedure variable is a syntax error.
    • A SELECT statement and a procedure whose last statement is a select statement both return a result set, but pretty much everything you'd ever like to do with a result set (like looping over it or checking if something is IN it) can only be targeted at a SELECT statement, not a CALL statement.
    • You can pass a session variable as an OUT parameter to a sproc, but you can't pass a sproc variable as an OUT parameter to a sproc.
  • Totally arbitrary restrictions and bizarre behaviours that blindside you:
    • No dynamic SQL allowed in functions, only in procedures
    • Using a cursor to fetch from a column into a procedure variable of the same name always sets the variable to NULL but throws no warning or error
  • Lack of ability to cleanly pass result sets between procedures

    Result sets are a basic type in SQL; they're what SELECTs return and you think about them as objects when using SQL from the application layer. But within a MySQL sproc, you can't assign them to variables or pass them from one sproc to another. If you truly need this functionality, you have to have one sproc write a result set into a temporary table so that another sproc can read it.

  • Eccentric and unfamiliar constructs and idioms:
    • Three equivalent ways of assigning to a variable - SET foo = bar, SELECT foo = bar and SELECT bar INTO foo.
    • You'd expect that you should use procedure variables for all your state and avoid session variables for the same reasons that you avoid globals in a normal programming language. But in fact you need to use session variables everywhere because so many language constructs (like OUT params and EXECUTE) won't accept any other kind of variable.
    • The syntax for using a cursor to loop over a result set just looks alien.

Despite these obstacles, you can still piece together small programs like this with sprocs if you are determined.

Mark Amery
  • 143,130
  • 81
  • 406
  • 459
  • Thanks, this code does what I want. A question regarding the syntax, though: What does `column IN (...) <=> 1` mean? What's the difference to `column IN (...)`? – Aaron Digulla Oct 24 '14 at 09:20
  • 1
    @AaronDigulla the difference is that `'foo' NOT IN ('bar', NULL)` is `NULL`, while `NOT ('bar' IN ('foo', NULL) <=> 1)` is `TRUE`. This is because `IN` returns NULL instead of FALSE when the left operand is not in the right operand but the right operand contains a NULL, and because `NOT NULL` is `NULL`. This behaviour of `IN` is highly unintuitive (although it's part of the ANSI SQL standard, not a quirk of MySQL, and is consistent with the way the usual SQL way of handling of NULL as an unknown when evaluating boolean conditions), and I've watched a colleague lose half a day to it before. – Mark Amery Oct 24 '14 at 09:32
0

Since I had some weird SQL syntax errors, here is a solution which uses SQL from the accepted answer and Groovy. Use orphanedNodeStatistics() to get the number of nodes per table which would be deleted, dumpOrphanedNodes(String tableName) to dump the PKs of nodes which would be deleted and deleteOrphanedNodes(String tableName) to delete them.

To delete all of them, iterate over the set returned by tablesTargetedByForeignKeys()

import groovy.sql.Sql

class OrphanNodesTool {

    Sql sql;
    String schema;

    Set<String> tablesTargetedByForeignKeys() {
        def query = '''\
SELECT referenced_table_name
FROM INFORMATION_SCHEMA.key_column_usage
WHERE referenced_table_schema = ?
'''
        def result = new TreeSet()
        sql.eachRow( query, [ schema ] ) { row ->
            result << row[0]
        }
        return result
    }

    String conditionsToFindOrphans( String tableName ) {
        List<String> conditions = []

        def query = '''\
SELECT referenced_column_name, column_name, table_name
FROM INFORMATION_SCHEMA.key_column_usage
WHERE referenced_table_schema = ?
    AND referenced_table_name = ?
'''
        sql.eachRow( query, [ schema, tableName ] ) { row ->
            conditions << "NOT (${tableName}.${row.referenced_column_name} IN (SELECT ${row.column_name} FROM ${row.table_name}) <=> 1)"
        }

        return conditions.join( '\nAND ' )
    }

    List<Long> listOrphanedNodes( String tableName ) {
        def query = """\
SELECT ${tableName}.${tableName}_ID
FROM ${tableName}
WHERE ${conditionsToFindOrphans(tableName)}
""".toString()

        def result = []
        sql.eachRow( query ) { row ->
            result << row[0]
        }
        return result
    }

    void dumpOrphanedNodes( String tableName ) {
        def pks = listOrphanedNodes( tableName )
        println( String.format( "%8d %s", pks.size(), tableName ) )
        if( pks.size() < 10 ) {
            pks.each {
                println( String.format( "%16d", it as long ) )
            }
        } else {
            pks.collate( 20 ) { chunk ->
                chunk.each {
                    print( String.format( "%16d ", it as long ) )
                }
                println()
            }
        }
    }

    int countOrphanedNodes( String tableName ) {
        def query = """\
SELECT COUNT(*)
FROM ${tableName}
WHERE ${conditionsToFindOrphans(tableName)}
""".toString()

        int result;
        sql.eachRow( query ) { row ->
                result = row[0]
        }
        return result
    }

    int deleteOrphanedNodes( String tableName ) {
        def query = """\
DELETE
FROM ${tableName}
WHERE ${conditionsToFindOrphans(tableName)}
""".toString()

        int result = sql.execute( query )
        return result
    }

    void orphanedNodeStatistics() {
        def tableNames = tablesTargetedByForeignKeys()
        for( String tableName : tableNames ) {
            int n = countOrphanedNodes( tableName )
            println( String.format( "%8d %s", n, tableName ) )
        }
    }
}

(gist)

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820