1

I'm searching for a SQL command that does the following:

From every set of rows with the same value in column A delete all rows but the row with the highest value in column B. If there are multiple rows with the same B value in a set keep at least one of them.

Additional notes

  • The column format should not be modified nor additional tables should be required to achieve the desired result.
  • The table has only two columns from which none is primary, unique or multiple occurrences key.
  • The query should work well with bigger datasets i.e. the running time should be proportional to the number of rows in the table (not quadratic/exponential).

Example:

Initial state:

+---+---+
| A | B |
+---+---+
| x | 1 |
| x | 2 |
| y | 3 |
+---+---+

Desired result:

+---+---+
| A | B |
+---+---+
| x | 2 |
| y | 3 |
+---+---+
timakro
  • 1,739
  • 1
  • 15
  • 31

4 Answers4

1

Oy, this is a pain. I think the best way is truncate/insert:

create table temp_t as
    select *
    from t;

truncate table t;

insert into t(a, b)
    select distinct a, b
    from temp_t tt
    where tt.b = (select max(tt2.b) from temp_t tt2 where tt2.a = tt.a);

Another alternative would be to add a third column and assign that a unique number, which can then be used for the delete.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • @timakro . . . This answer actually has the functionality that you describe. The answer you accepted does not. – Gordon Linoff Oct 28 '16 at 01:50
  • @timakro This is the only solution that gets rid of the duplicates in the result, unless you're willing to change the table schema. – Barmar Oct 28 '16 at 03:05
0

This query should work!!

I am retaining all the MAX B's for each A and deleting all the remaining rows from the table which are not required.

DELETE s1
FROM table s1,
    ( SELECT A,MAX(B) AS B FROM table GROUP BY A ) s2
WHERE s1.A = s2.A 
    AND s1.B <> s2.B;
Barmar
  • 741,623
  • 53
  • 500
  • 612
Teja
  • 13,214
  • 36
  • 93
  • 155
  • No. One `A` may have a max `B` that another `A` has for a non-max `B` value. This would mean that some non-max `B` rows could be left undeleted – Bohemian Oct 27 '16 at 18:54
  • I included A as well in the IN clause so that it will remove all rows which do not meet MAXB for each A. – Teja Oct 27 '16 at 18:59
  • Doesn't this get the error "You can't specify target table 'Table' for update in FROM clause"? – Barmar Oct 27 '16 at 19:18
  • @Barmar You are correct, I first thought this worked but even after fixing two minor syntactic errors (one of which you discovered too) it doesn't give the correct results. – timakro Oct 27 '16 at 19:21
  • You guys are right.. I have modified the logic and added the new one.. pls check it and let me know if its working or not....... – Teja Oct 27 '16 at 19:23
  • I think this would be the syntactically correct version (which still doesn't work): `DELETE FROM Table WHERE (A,B) NOT IN (SELECT A,MAX(B) FROM (SELECT A,B FROM Table) x group by A);` – timakro Oct 27 '16 at 19:23
  • @timakro The rule is that you can't have use the same table you're deleting from in a subquery in `WHERE`. It has to be done with a join, not `WHERE IN` or `WHERE NOT IN`. – Barmar Oct 27 '16 at 19:25
  • Even if this worked the runtime wouldn't be proportional to the number of rows in the table. – timakro Oct 27 '16 at 19:27
0

Use MySQL's multi-table delete:

delete t2
from mytable t1, mytable t2
where t2.A = t1.A
and t2.B < t1.B

An index on column A will make this perform well.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • I forgot about multiple occurrences keys, I don't have have them either, updated the question. Sorry! – timakro Oct 27 '16 at 19:06
-1

Use a JOIN with a subquery that gets the maximum B for each A, and make the JOIN condition match the rows where B doesn't match this.

DELETE t1
FROM Table AS t1
JOIN (
    SELECT A, MAX(B) AS maxB
    FROM Table
    GROUP BY A) AS t2
ON t1.A = t2.A AND t1.B != maxB

To get rid of the duplicates that remain, use one of the solutions in How to delete duplicates on a MySQL table?

Community
  • 1
  • 1
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • I updated the sample data to fit my description, the result of your query differs in that there are two columns with `A="x"` and `B=2` left instead of one. – timakro Oct 27 '16 at 19:35
  • You can use a second query to remove all the duplicate rows after this is done. See http://stackoverflow.com/questions/2630440/how-to-delete-duplicates-on-a-mysql-table – Barmar Oct 27 '16 at 19:45
  • The only way to identify rows in MySQL is by their column contents. If you have identical rows, there's no way to delete some of them without deleting all of them. You need an additional column, such as an auto-incremented ID, to distinguish them. – Barmar Oct 27 '16 at 19:48
  • Other databases have a built-in `ROWID` identifier that can be used for this, but not MySQL. – Barmar Oct 27 '16 at 19:50
  • Couldn't that additional column be added in a subquery? – timakro Oct 27 '16 at 20:37
  • No, because that still won't let you tell the DB which of them to delete in the actual table. – Barmar Oct 27 '16 at 20:49
  • Okay, I'll do it with a second query. – timakro Oct 27 '16 at 21:01
  • How does this answer solve your problem? This query will still return the duplicates in the final output... 'X'|2 will be coming up twice in the end result... – Teja Oct 28 '16 at 03:01
  • @Teja He conceded that this gets rid of the lower values, he'll use one of the de-duplicating solutions at the linked question to get rid of the duplicates that remain. – Barmar Oct 28 '16 at 03:03
  • @Teja If he doesn't modify the schema, there's no practical way to get rid of the duplicates in the final result. – Barmar Oct 28 '16 at 03:04
  • I tested the command on the real dataset today, with about 150,000 entries this takes forever. I now understood that the join compares every row from t1 to every row from t2 which makes the running time not proportional to the number of rows in the table, thats why I disaccepted the answer again. – timakro Oct 28 '16 at 11:40
  • @Teja I figured that deleting duplicates is not worth the disadvantages that those solutions have and updated my question. Now at least one of the duplicate rows should not be removed. – timakro Oct 28 '16 at 11:48
  • @timakro For any large join to be efficient you need an index on the column(s) being joined. A composite index on `(a, b)` should speed this up. – Barmar Oct 28 '16 at 14:27