How can I delete duplicate rows from a table without naming columns in the query?

Question

I am tracking history of changes to rows in a table that is filled with a trigger on update of another table. It tracks the revision history of the main table.

Often, my users, out of habit, will hit the SAVE button even though they have not changed anything in the record, and the system will still record a copy of that row as a revision in the history table, despite the fact that nothing has changed.

Lets say I have the tables with columns like this (although mine have about 40+ cols):

Main Data:

id, name, phone, task, dob, timestamp, note, drivername, student, doctor, userid

On Update of Main Data, insert into history:

revisionid, revisiontime, id, name, phone, task, dob, timestamp, note, drivername, student, doctor, userid

The solutions to find duplicate records presented in this site and on other sites all will work well, if I wanted to list out the columns by hand.

The problem is that there are many many columns, and that I often add columns and don't want to rewrite this query every time.

When the user saves, often only the timestamp will change. What I want to do is keep only the revisions where values have changed (ignoring the revisionid and revisiontime which always change).

In the query, I dont want to list any other column names besides the columns which i want to ignore. Is it possible?

Pseudo code:

DELETE [rows, except one] FROM historytable WHERE [all columns match values] EXCEPT [these few columns which can still be different and be deleted]

Here are a few reference questions:

Deleting duplicate rows from a table

How to check for duplicates in mysql table over multiple columns

MySQL remove duplicates from big database quick

Usually you do this before you commit to the database. Compare existing vs. requested and if there's no relevant changes, ignore it. Pruning your history table like that is not going to be easy. — tadman, Jul 10 '14 at 20:47
Yes the solution for that is here: http://stackoverflow.com/questions/6296313/mysql-trigger-after-update-only-if-row-has-changed Unfortunately, I inherited this issue.. — ethanpil, Jul 10 '14 at 20:49
Not a very orthodox solution, but you could maybe query `information_schema` for the names of columns, then dinamically build your DELETE statement using this information. — abl, Jul 10 '14 at 20:53
how many of these columns actually are expected to change? can you group by any of them and just use a select statement to find a distinct subset and then throw that into a not in clause in your delete statement? — rhealitycheck, Jul 10 '14 at 21:12
@abl now thats an idea! i will look into that now, didnt think of it. — ethanpil, Jul 10 '14 at 21:14
@rhealitycheck we need full accountability on all data points... — ethanpil, Jul 10 '14 at 21:15

score 0 · Answer 1 · answered Jul 10 '14 at 21:14

No, it isn't possible to delete duplicates from a table without specifying the columns.

The only way I know of to use a SQL statement to trim a table of dups without specifying an explicit column list is to do the following. Create a new copy with only distinct records:

create table T_UNIQUES as select distinct * from T;

You'd have to create a new table, rename the old one and then rename the new one into place. This is sometimes done on data warehouses when a DELETE operation is too slow. However, this doesn't ignore any timestamp columns, so it may not be adequate.

The only way I know to write a prune your history table with something automatic and extensible is to extract the columns from the data dictionary (INFORMATION_SCHEMA). This only automates it, but doesn't avoid specifying the columns in question.

My approach would be to fix the trigger. It sounds broken / inadequate; I would rewrite it to do an "UPSERT" instead of a blind INSERT.

I'd be grateful if you have a look at my approach and/or have any remark. Thanks — J A, Jul 10 '14 at 23:01

J A · Answer 2 · 2014-07-10T23:07:39.497

My thought process is as following..

List all the column names (with an exclusion list)

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA='db' AND TABLE_NAME='table' AND COLUMN_NAME NOT IN ('columnToIgnore')
Store the names as rows in a temporary table

CREATE TEMPORARY TABLE IF NOT EXISTS columnNames AS (step1);
Fetch all records from temporary table 'columnNames' and store in a variable.

SELECT GROUP_CONCAT(COLUMN_NAME) into @cols FROM columnNames;
Prepare the final statement, list all the redundant rows. (I used SELECT for checking)

SET @sql = CONCAT('SELECT CONCAT_WS(" ",',@cols,') AS allColumns FROM targetTable GROUP BY allcolumns');

To sum up,

CREATE TEMPORARY TABLE IF NOT EXISTS columnNames AS (SELECT `COLUMN_NAME` 
FROM `INFORMATION_SCHEMA`.`COLUMNS` 
WHERE `TABLE_SCHEMA`='dbName' 
    AND `TABLE_NAME`='tableName'
    AND `COLUMN_NAME` NOT IN ('columnNameToIgnore'));

SELECT GROUP_CONCAT(COLUMN_NAME) into @cols FROM columnNames;

SET @sql = CONCAT('SELECT CONCAT_WS(" ",',@cols,')  AS allColumns FROM targetTable GROUP BY allcolumns');

PREPARE stmt FROM @sql;
EXECUTE stmt;

Who says we can't use chainsaw to slice a bread ;)

How can I delete duplicate rows from a table without naming columns in the query?

2 Answers2