0

Now, as I understand when you use aggregate functions such as AVG, SUM etc you have to keep in mind that any other fields you SELECT that aren't also involved in an aggregate function will be indeterminate, for example:

SELECT AVG(amount), name, desc FROM some_table;

I understand this and this is because the value coming from the aggregate function isn't tied to any one row and hence the other fields selected are indeterminate.

However, if you use a different type of aggregate function such as MIN or MAX where what they retrieve is tied to a certain row then is it safe to assume that any other fields selected that aren't within an aggregate function can be determined? ... as the result would be tied to a specific row of data unlike the other aggregate function results?

For example:

SELECT MIN(media_id),
       auction_id,
       media_url
FROM   auction_media
WHERE  auction_id IN( 119925, 124660, 124663, 129078,
                      129094, 134395, 149753, 152221,
                      154733, 154737, 154742, 157694,
                      161411, 165965, 165973 )
       AND media_type = 1
       AND upload_in_progress = 0
GROUP  BY auction_id;

If I am right in my thinking this would always return the correct media_url right?

Brett
  • 19,449
  • 54
  • 157
  • 290

3 Answers3

2

However, if you use a different type of aggregate function such as MIN or MAX where what they retrieve is tied to a certain row then is it safe to assume that any other fields selected that aren't within an aggregate function can be determined?

Nope. For one, multiple rows can have the min or max value; for another, there is nothing stopping a query from selecting MIN(a), MAX(a), AVG(a), and SUM(a) all at once (and I highly doubt MySQL would over-complicate it's query engine to take advantage of "if the query has only one aggregate...")


Note: I am fairly certain the only reason MySQL originally even allowed such queries was for short hand in situations like:

SELECT a.*, SUM(b.X)
FROM a INNER JOIN b ON a.PK = b.a_PK
GROUP BY a.PK;

where the query author knows the non-aggregated fields are can be determined by virtue of the grouping, not the aggregated value(s).

Uueerdo
  • 15,723
  • 1
  • 16
  • 21
  • Hmmm you make a good point. However for your first point, in my case, for this situation, the `media_id` is unique and hence couldn't duplicate. – Brett Apr 09 '18 at 20:22
  • Yeah, I feel the bulk of my answer lies in the second point and it's parenthesized note about unlikely optimizations... especially since MySQL is the only rdbms that even allows such aggregate queries, and even it's most recent releases default to a configuration that disallows them. – Uueerdo Apr 09 '18 at 20:26
  • Also, keep in mind that media_id's uniqueness has even less relevance since you are _grouping on_ auction_id; it can be highly data dependent whether a field such as media_id would be unique on other groupings. For MySQL to guess you wanted such behavior it would have to do some rather comprehensive data analysis. – Uueerdo Apr 09 '18 at 20:29
  • 1
    Also even if media_id is unique in one table, the query could be doing a join to another table, so media_id *wouldn't* be unique within the group of rows generated in the join. – Bill Karwin Apr 09 '18 at 20:31
  • Thanks - some food for thought there. :) – Brett Apr 09 '18 at 20:34
2

MIN and MAX is no more tied to any row than AVG or SUM is. All 4 of them are the result of aggregating multiple rows, whether all rows (like you first query), or the rows in a group (like your second query).

If I am right in my thinking this would always return the correct media_url right?

No. What if your data is:

auction_id   media_id   media_url
119925       3          http://google.com
119925       5          http://yahoo.com
119925       3          http://bing.com

Your query SELECT MIN(media_id), auction_id, media_url GROUP BY auction_id would return 3 for MIN(media_id), and 119925 for auction_id, but what media_url would it return?

media_url is still indeterminate.

You see, there is nothing in the data that says that media_url is in any way related to media_id.

You might (think you) know that the denormalized media_url is always the same for a particular media_id, but that doesn't matter to the SQL engine.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • I probably should of pointed out that `media_id` is unique. – Brett Apr 09 '18 at 20:24
  • @Brett Doesn't matter to the SQL engine. `media_url` is still indeterminate as far as the SQL statement is concerned. Unique indexes will never affect the result of a query, though indexes may of course affect the performance of getting that result. Unique indexes affect the data that can be *inserted* into the table, but not queries *from* the table. – Andreas Apr 09 '18 at 20:28
2

No. The unaggregated columns (that are not in the group by) in an aggregation query come from arbitrary and indeterminate rows. This awkward behavior is why the syntax is not allowed in most databases and why the most recent versions of MySQL "turn-it-off" by default. So your query would return an error.

Here is one way to do what you want:

SELECT am.*
FROM auction_media am
WHERE auction_id IN (119925, 124660, 124663, 129078,
                      129094, 134395, 149753, 152221,
                      154733, 154737, 154742, 157694,
                      161411, 165965, 165973 ) AND
      media_type = 1 AND upload_in_progress = 0 AND
      media_id = (SELECT MIN(am2.media_id)
                  FROM auction_media m2
                  WHERE m2.auction_id = m.auction_id AND m2.media_type = m.media_type AND m2.upload_in_progress = m.upload_in_progress
                 );

For performance you want an index on auction_media(auction_id, media_type, upload_in_progress, media_id) and auction_media(media_type, upload_in_progress, auction_id).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks for that. I had a similar looking query before in the main query, but I was trying to separate the portion about getting the media_url into it's own query to avoid an expensive join and subquery as it was really slow. – Brett Apr 09 '18 at 20:28
  • I think the better method to do it now would be to remove all the aggregate functions and get all the media_urls and then just process them via PHP to get the data with the lowest media_id. – Brett Apr 09 '18 at 20:36