SQL `group by` vs. `order by` Performance

Question

tl;dr - lots of accepted stackoverflow answers suggest using a subquery to affect the row returned by a GROUP BY clause. While this works, is it the best advice?

I understand there are many questions already about how to retrieve a specific row in a GROUP BY statement. Most of them revolve around using a subquery in the FROM clause. The subquery will order the table appropriately and the group by will be run against the now-ordered temporary table. Some examples,

PostgreSQL removes the need for the subquery with the distinct on() clause.

Postgresql DISTINCT ON with different ORDER BY

However, what I'm not understanding in any of these cases is how badly I'm shooting myself in the foot trying to do something the system may not have originally been designed for. Take the following two examples in PostgreSQL and MySQL,

In both cases I have a table of posts that contain multiple versions of the same post (signified by its UUID). I want to select the most recently published version of each post ordered by it's created_at field.

My biggest concern is that given the MySQL approach a temporary table is necessary. Ratchet this up to "web scale" (lolz) and I'm wondering if I'm in for a world of hurt. Should I rethink my schema or are there ways to optimize the subquery-parentquery relationship enough that it'll be alright?

Gordon Linoff · Accepted Answer · 2014-05-23T11:26:13.937

It is definitely not the best advice. SQL itself (and the MySQL documentation as far as I can tell) has little to say about the results from a subquery with an order by. Although they may be ordered in practice, they are not guaranteed to be.

The more important issue is the use of "hidden columns" in the aggregation. Consider this basic query:

select t.*
from (select t.* from table t order by datecol) t
group by t.col;

Everything except t.col in the select comes from an indeterminate row. The specific documentation is (emphasis is mine):

MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses.

A safe way to write such a query is:

select t.*
from table t
where not exists (select 1
                  from table t2
                  where t2.col = t.col and t2.datecol < t.datecol
                 );

This is not exactly the same, because it will return multiple values if the minimum is not unique. The logic is "get me all rows in the table where there are no rows with the same col value and a smaller datecol value.

EDIT:

The question in your comment doesn't make sense, because nothing is discussing two queries. In MySQL you can use order by with variables to solve this:

select t.*
from (select t.*,
             @rn := if(@col = col, @rn := @rn + 1, 1) as rn,
             @col := col
      from table t cross join
           (select @col := '', @rn := 0) vars
      order by col, datecol) t
where rn = 1;

This method should be faster than the order by with group by.

Here's where theory diverts from practice - because practically, every (subquery supporting) version of MySQL (to date) will return the first row of the ordered result, regardless of what's suggested by the documentation. — Strawberry, May 21 '14 at 14:16
This is great. I understand the risk of _what_ row is chosen. However, from a performance standpoint will I be able to optimize this query as much as I could if it were two queries? Maybe this is a separate question… I could accept this and ask another. — Mark Huot, May 23 '14 at 10:59

SQL `group by` vs. `order by` Performance

1 Answers1