13

Problem

Suppose I have this table tab (fiddle available).

| g | a | b |     v |
---------------------
| 1 | 3 | 5 |   foo |
| 1 | 4 | 7 |   bar |
| 1 | 2 | 9 |   baz |
| 2 | 1 | 1 |   dog |
| 2 | 5 | 2 |   cat |
| 2 | 5 | 3 | horse |
| 2 | 3 | 8 |   pig |

I'm grouping rows by g, and for each group I want one value from column v. However, I don't want any value, but I want the value from the row with maximal a, and from all of those, the one with maximal b. In other words, my result should be

| 1 |   bar |
| 2 | horse |

Current solution

I know of a query to achieve this:

SELECT grps.g,
(SELECT v FROM tab
 WHERE g = grps.g
 ORDER BY a DESC, b DESC
 LIMIT 1) AS r
FROM (SELECT DISTINCT g FROM tab) grps

Question

But I consider this query rather ugly. Mostly because it uses a dependant subquery, which feels like a real performance killer. So I wonder whether there is an easier solution to this problem.

Expected answers

The most likely answer I expect to this question would be some kind of add-on or patch for MySQL (or MariaDB) which does provide a feature for this. But I'll welcome other useful inspirations as well. Anything which works without a dependent subquery would qualify as an answer.

If your solution only works for a single ordering column, i.e. couldn't distinguish between cat and horse, feel free to suggest that answer as well as I expect it to be still useful to the majority of use cases. For example, 100*a+b would be a likely way to order the above data by both columns while still using only a single expression.

I have a few pretty hackish solutions in mind, and might add them after a while, but I'll first look and see whether some nice new ones pour in first.


Benchmark results

As it is pretty hard to compare the various answers just by looking at them, I've run some benchmarks on them. This was run on my own desktop, using MySQL 5.1. The numbers won't compare to any other system, only to one another. You probably should be doing your own tests with your real-life data if performance is crucial to your application. When new answers come in, I might add them to my script, and re-run all the tests.

So it seems that my own solution so far isn't all that bad, even with the dependent subquery. Surprisingly, the solution by acatt, which uses a dependent subquery as well and which I therefore would have considered about the same, performs much worse. Probably something the MySQL optimizer can't cope with. The solution RichardTheKiwi proposed seems to have good overall performance as well. The other two solutions heavily depend on the structure of the data. With many groups small groups, xdazz' approach outperforms all other, whereas the solution by Dems performs best (though still not exceptionally good) for few large groups.

Community
  • 1
  • 1
MvG
  • 57,380
  • 22
  • 148
  • 276
  • What index(es) did you apply to the table? Also, note that RichardTheKiwi's approach appears to be quite stable. I'd also estimate that it is linear as you scale the number of items in total. – MatBailie Oct 04 '12 at 18:22
  • @Dems, auto increment id as primary key which is not involved in the query, and `(g,a,b)` as a composite unique key. As sorting takes O(n log n) and I doubt that MySQL is clever enough to optimize the sorting to a selection, I estimate his solution to be slightly slower than linear, but the difference seems small. [Script available at dpaste](http://dpaste.com/hold/809915/) for a while. – MvG Oct 04 '12 at 19:02
  • cc @Dems // First rule of benchmarks is to state the margin of error. My solution should produce almost the same result for all 3, and the range is about right. Surprising the 2nd would blow out by that much. I can't be fussed right now, but would be interesting to see the EXPLAIN against all queries. Your query would be (n log n) and heavily dependent on group density leading using DISTINCT. – RichardTheKiwi Oct 04 '12 at 19:58

4 Answers4

5

This way doesn't use sub-query.

SELECT t1.g, t1.v
FROM tab t1
LEFT JOIN tab t2 ON t1.g = t2.g AND (t1.a < t2.a OR (t1.a = t2.a AND t1.b < t2.b))
WHERE t2.g IS NULL

Explanation:

The LEFT JOIN works on the basis that when t1.a is at its maximum value, there is no s2.a with a greater value and the s2 rows values will be NULL.

xdazz
  • 158,678
  • 38
  • 247
  • 274
  • [Incorrect result](http://www.sqlfiddle.com/#!2/ad3987/16): you don't even select the `v` value. Your query basically does `SELECT g, MAX(a) FROM tab GROUP BY g` only. – MvG Oct 04 '12 at 11:57
  • @MvG My typo, change `t1.a` to `t1.v` will give you the right result. – xdazz Oct 04 '12 at 12:01
  • Still won't break the tie between cat and horse yet, but that can be fixed. Feel free to include the [fixed version](http://www.sqlfiddle.com/#!2/ad3987/33) into your answer. – MvG Oct 04 '12 at 12:04
  • @MvG Yes, your fixed version is right, I didn't read all of your question. – xdazz Oct 04 '12 at 12:06
  • `This way doesn't use sub-query.` and that's good? This is O(n^2) and not something I'd run in production – RichardTheKiwi Oct 04 '12 at 13:20
  • 1
    @RichardTheKiwi: For small group sizes, this query outperforms yours, according to my benchmarks which I've edited into the question. So asymptotic complexity isn't everything here, it very much depends on the actual data. – MvG Oct 04 '12 at 15:49
5
SELECT g, a, b, v
  FROM (
            SELECT *, 
                   @rn := IF(g = @g, @rn + 1, 1) rn, 
                   @g := g
              FROM (select @g := null, @rn := 0) x, 
                   tab
          ORDER BY g, a desc, b desc, v
       ) X
 WHERE rn = 1;

Single pass. All the other solutions look O(n^2) to me.

David Harkness
  • 35,992
  • 10
  • 112
  • 134
RichardTheKiwi
  • 105,798
  • 26
  • 196
  • 262
  • This looks pretty good in terms of performance, as you employ an idea similar to what Dems writes about `ROW_NUMBER` in other RDBMS. What has me worried is that according to the [MySQL docs](http://dev.mysql.com/doc/refman/5.6/en/user-variables.html), *“The order of evaluation for expressions involving user variables is undefined […]”* (please also read surrounding paragraph). As I see it, there is no guarantee that the rows will get numbered in the order specified by the `ORDER BY`. Is there? – MvG Oct 04 '12 at 13:26
  • Posted [another question](http://stackoverflow.com/q/12728949) asking about guarantees for this kind of query. – MvG Oct 04 '12 at 14:01
  • I've accepted this answer. On the one hand it exhibits good behaviour over a wide range of my benchmarks. On the other hand it is code I wouldn't have used before asking this question, but I'd consider to use now, so I learned quite a lot from it. Even though I keep wondering about guarantees. – MvG Oct 08 '12 at 06:13
1

Many RDBMS have constructs that are particularly suited to this problem. MySQL isn't one of them.

This leads you to three basic approaches.

  • Check each record to see if it is one you want, using EXISTS and a correlated sub-query in an EXISTS clause. (@acatt's answer, but I understand that MySQL doesn't always optimise this very well. Ensure that you have a composite index on (g,a,b) before assuming that MySQL won't do this very well.)

  • Do a half cartesian product to full-fill the same check. Any record which does not join is a target record. Where each group ('g') is large, this can quickly degrade performance (If there are 10 records for each unique value of g, this will yield ~50 records and discard 49. For a group size of 100 it yields ~5000 records and discard 4999), but it is great for small group sizes. (@xdazz's answer.)

  • Or use multiple sub-queries to determine the MAX(a) and then the MAX(b)...

Multiple sequential sub-queries...

SELECT
  yourTable.*
FROM
  (SELECT g,    MAX(a) AS a FROM yourTable GROUP BY g   ) AS searchA
INNER JOIN
  (SELECT g, a, MAX(b) AS b FROM yourTable GROUP BY g, a) AS searchB
    ON  searchA.g = searchB.g
    AND searchA.a = searchB.a
INNER JOIN
  yourTable
    ON  yourTable.g = searchB.g
    AND yourTable.a = searchB.a
    AND yourTable.b = searchB.b

Depending on how MySQL optimises the second sub-query, this may or may not be more performant than the other options. It is, however, the longest (and potentially least maintainable) code for the given task.

Assuming an composite index on all three search fields (g, a, b), I would presume it to be best for large group sizes of g. But that should be tested.

For small group sizes of g, I'd go with @xdazz's answer.

EDIT

There is also a brute force approach.

  • Create an identical table, but with an AUTO_INCREMENT column as an id.
  • Insert your table into this clone, ordered by g, a, b.
  • The id's can then be found with SELECT g, MAX(id).
  • This result can then be used to look-up the v values you need.

This is unlikely to be the best approach. If it is, it is effectively a condmenation of MySQL's optimiser's ability to deal with this type of problem.

That said, every engine has it's weak spots. So, personally, I try everything until I think I understand how the RDBMS is behaving and can make my choice :)

EDIT

Example using ROW_NUMBER(). (Oracle, SQL Server, PostGreSQL, etc)

SELECT
  *
FROM
(
  SELECT
    ROW_NUMBER() OVER (PARTITION BY g ORDER BY a DESC, b DESC) AS sequence_id,
    *
  FROM
    yourTable
)
  AS data
WHERE
  sequence_id = 1
MatBailie
  • 83,401
  • 18
  • 103
  • 137
  • It's not really part of my original question, but I'd welcome a few examples of what constructs other RDBMS provide. Perhaps a parenthesis naming a few, with links where available. – MvG Oct 04 '12 at 12:42
  • @MvG - `ROW_NUMBER()` answer added. This is an analytic function, or window function, that some RDBMS have implemented. It is part of the ever evolving SQL standard. The standard defines language constructs, not how to implement them - this means that the different RDBMS have different sub-sets of that standard ;) – MatBailie Oct 04 '12 at 12:53
0

This can be solved using a correlated query:

SELECT g, v
FROM tab t
WHERE NOT EXISTS (
    SELECT 1
    FROM tab
    WHERE g = t.g
        AND a > t.a
        OR (a = t.a AND b > t.b)
    )
acatt
  • 487
  • 3
  • 10
  • According to the [execution plan](http://www.sqlfiddle.com/#!2/ad3987/35) this `NOT EXISTS` thing is still a DEPENDENT SUBQUERY: it will be executed repeatedly, once for each row in the table. – MvG Oct 04 '12 at 12:09
  • @MvG I was under the impression that there were performance issues related to dependent sub-queries but this was resolved at some point. Apologies if this isn't the case. Moreover, xdazz's solution appears to be the best here. – acatt Oct 04 '12 at 12:35
  • @acatt - I would suggest that there may be a tipping point. The half-cartesian-product version may scale much worse than this answer. As the size of the groups grows this may become relatively more effective, and may at some point become better. – MatBailie Oct 04 '12 at 12:41
  • I was under the same impression, which is why I wanted to avoid them. If issues were resolved, I wonder whether the version in my question benefits from that optimization as well. @Dems, thanks for pointing this out. Will make accepting a *single* answer a lot harder, though. – MvG Oct 04 '12 at 12:46
  • @MvG : Accept either the most informative answer to you, or the answer that contains the approach you actually use. And upvote all the others. *(Then post a cheque for £10 to me at my home address of **BEEEEEEEEEEEEEEEP**)* – MatBailie Oct 04 '12 at 12:50