-1

The Question

I have two SQL tables albums and photos. Albums can be hierarchically nested as a tree and use the nested set paradigm. Photos are children of albums.

I need an SQL query which returns the ID of the cover photo for each album ID. The cover photo shall be the best rated photo among all recursive child photos of an album.

The solution must work with PostgreSQL, MySQL and SQLite; i.e. it should basically only use standard SQL feature at most only non-standard features which are provided by all DBMS.

I know that similar questions have already been asked (e.g. in "Select first row in each GROUP BY group?"), but I only could find answers which use SQL extensions of specific DBMS.

I already have one approach, but it is prohibitive slow.

The Environment

The albums table (simplified):

CREATE TABLE albums (
  id INTEGER PRIMARY KEY,
  parent_id INTEGER,
  _lft INTEGER NOT NULL,
  _rgt INTEGER NOT NULL,
)

The photos table (simplified)

CREATE TABLE photos (
  id INTEGER PRIMARY KEY,
  created_at DATETIME NOT NULL,
  album_id INTEGER,
  mime_type VARCHAR NOT NULL
  width INTEGER NOT NULL,
  height INTEGER NOT NULL,
  is_starred BOOLEAN NOT NULL DEFAULT false
)

The best-rated photo is defined as the a photo which is preferably starred and then the most recent one.

First Approach: Correct, but Prohibitivly Slow

SELECT
  best_child_photo.album_id AS album_id,
  best_child_photo.cover_id AS cover_id,
  photos.mime_type AS cover_mime_type,
  photos.width AS cover_width,
  photos.height AS cover_height
FROM (
  SELECT
    covered_albums.id AS album_id,
    (
      SELECT p.id
      FROM photos AS p
      LEFT JOIN albums AS direct_parents ON (direct_parents.id = p.album_id)
      WHERE direct_parents._lft >= covered_albums._lft AND direct_parents._rgt <= covered_albums._rgt
      ORDER BY p.is_starred DESC, p.created_at DESC
      LIMIT 1
    ) AS cover_id
  FROM albums AS covered_albums
) AS best_child_photo
LEFT JOIN photos ON (photos.id = best_child_photo.cover_id);

The slow part is the inner, single-value query which finds the ID of the best child photo for each album ID.

Second Approach: Fast, but Incomplete

A wrong, but much faster query is

SELECT
  albums.id AS album_id,
  photos.id AS cover_id,
  photos.mime_type AS cover_mime_type,
  photos.width AS cover_width,
  photos.height AS cover_height
FROM albums
LEFT JOIN (
  photos
  LEFT JOIN albums AS direct_parents
  ON (direct_parents.id = photos.album_id)
)
ON (direct_parents._lft >= albums._lft AND direct_parents._rgt <= albums._rgt);
ORDER BY album_id ASC, photos.is_starred DESC, photos.created_at DESC;

It is wrong, because it does not return a single row for each album but maps each album to all of its recursive child photos. Despite the fact that it returns many, many more rows it is two magnitudes faster then the first query. With the second approach the query planner can use its index tree to perform the left join.

Discussion

As a rule of thumb one can say that: "Sort first, limit to 1, join last" is slower than "Join first (with everything), sort last". As you see, the second approach misses the "limit to 1" step.

So I wonder if it is possible to use something which is based on the second approach and then filter for the first row in each partition with the same album_id. Unfortunately, something like

SELECT
  albums.id AS album_id,
  FIRST(photos.id) AS cover_id,
  FIRST(photos.type) AS cover_type
FROM ...
...
ORDER BY album_id ASC, photos.is_starred DESC, photos.created_at DESC
GROUP BY album_id

is invalid, because there is no aggregate function FIRST.

Any ideas?

Final remarks: I know that MySQL allows columns in the SELECT-clause which are neither grouped nor aggregate functions. In this case MySQL uses values of the first row which is exactly what I need, but it is MySQL-specific. With PostgreSQL I could use DISTINCT ON (album_id) which also would give my the desired result, but DISTINCT ON is only supported by PostgreSQL.

user2690527
  • 1,729
  • 1
  • 22
  • 38
  • Either your description is wrong or you are doing something more complicated than you have to (if this a task someone gave you, maybe recheck), but "I need an SQL query which returns the ID of the cover photo for each album ID" does not involve any recursivity, as the photos are directly linked to a specific album id. Albums seem to have a hierarchy, but it is not really clear how they are involved. Can you fix/clarlify either the description, maybe giving a sample data/result showing the hierarchy? – Solarflare Oct 24 '21 at 11:54
  • The design looks a little strange. Note that parent_id of each album seems to be unused. The only hierarchy of photos for one album is when a photo is related to another related album. Can you show some minimal data and the corresponding expected result? – Jon Armstrong Oct 24 '21 at 11:56
  • Also, "limit" already is not sql standard (but will work in your 3 mentioned rdbms), while ctes and window functions are sql standard, but may not work on your specific version of the database you use (e.g. MySQL 5.7). You may want to clarify any restrictions you have in that regard – Solarflare Oct 24 '21 at 12:09
  • And a third concern: portability and speed are not necessarily goals that can be reached simultaneously. Neither is the fastest code on one specific rdbm (even among versions of the same rdbm) guaranteed to be the fastest on another (because some may have some optimization that another doesn't), nor is a fully portable code necessarily fast on any of them - part of what developers do is work around/use/know about all those little sweet peculiarities of the rdbm of choice. Again, you may want to elaborate on that requirement. – Solarflare Oct 24 '21 at 12:29
  • Always consider design and logic first. Get that right, before worrying about performance. For instance, an album at the top of one hierarchy of albums with a starred photo that is the newest photo (by `created_at`) in the table, will be the cover for every album in that hierarchy. This seems odd to me. Can you explain why this is the correct design / logic behavior? Any newly starred photo in that hierarchy will become the cover for all related albums. – Jon Armstrong Oct 24 '21 at 13:12
  • @JonArmstrong wrote: "Note that parent_id of each album seems to be unused." Yes, it is unused in this specific use case, because everything can be expressed by `_lft` and `_rgt` boundaries. The `parent_id` of albums is required elsewhere (out of the scope of this query). I recommend to read [Nested Set Model](https://en.wikipedia.org/wiki/Nested_set_model). – user2690527 Oct 24 '21 at 14:34
  • @Solarflare wrote 'An SQL query which returns the ID of the cover photo for each album ID does not involve any recursivity, as the photos are directly linked to a specific album id. Albums seem to have a hierarchy, but it is not really clear how they are involved." It is all in the question. If you had also read the second sentence after the sentence you have quoted, it clearly says: _The cover photo shall be the best rated photo **among all recursive child photos** of an album._ If the best rated photo is in a leaf album than this photo is also the cover for the parent albums. – user2690527 Oct 24 '21 at 14:36
  • @JonArmstrong wrote: _"Always consider design and logic first. Get that right, before worrying about performance."_ Thank you, but the logic is correct. JonArmstrong wrote: _"For instance, an album at the top of one hierarchy of albums with a starred photo that is the newest photo, will be the cover for every album in that hierarchy."_ No it won't. Cover photos of albums are always photos within that album or within one of its children. However what may happen is that a parent album uses the same cover as one of its child albums, if the photo is the best rated. This is fine. – user2690527 Oct 24 '21 at 14:43
  • Would you provide a test case? The tests I've created show that photos from different albums can become covers for other albums, based on the album hierarchy. Photos have no other hierarchy. Maybe your use of album hierarchy is very limited, in practice. As far as standard SQL, this can be done without correlated behavior. That might help performance. Are we talking about music albums, where your concept of children are part of an album set? – Jon Armstrong Oct 24 '21 at 14:52
  • @JonArmstrong wrote: _"The tests I've created show that photos from different albums can become covers for other albums, based on the album hierarchy."_ How did you construct the left and right boundary values of each node? I guess you set them wrong. I have double checked the inequality comparisons to ensure that I haven't mixed some `<=` and `>=`, but they are all correct. JonArmstrong wrote: _"Are we talking about music albums, where your concept of children are part of an album set?"_ We are obviously speaking about photo albums. – user2690527 Oct 24 '21 at 15:04
  • I took a common example of a nested set (using your tables) and used that as the basis of a test case for this photo albums example, and copied your logic exactly to handle that nested set logic and your order logic for choosing the cover on T/F starred and created_at values. It would be better if you created the test case. I can create an answer with the details, if you wish, including the standard SQL that might be used to determine the cover for each album. – Jon Armstrong Oct 24 '21 at 15:12
  • Which version of MySQL? It changed with the advent of `ONLY_FULL_GROUP_BY`. And 8.0 has `OVER ( PARTITION BY ... )` – Rick James Oct 24 '21 at 22:00
  • See http://mysql.rjweb.org/doc.php/groupwise_max – Rick James Oct 24 '21 at 22:00

1 Answers1

0

This is just one suggestion for using standard SQL without correlated behavior to determine the cover photo per album.

WITH cte AS (
        SELECT covered_albums.*
             , p.id AS photo_id
             , p.album_id AS photo_album
             , ROW_NUMBER() OVER (PARTITION BY covered_albums.id ORDER BY is_starred DESC, created_at DESC) AS rn
          FROM photos AS p
          JOIN albums AS direct_parents ON (direct_parents.id = p.album_id)
          JOIN albums AS covered_albums
            ON direct_parents._lft >= covered_albums._lft
           AND direct_parents._rgt <= covered_albums._rgt
     )
SELECT *
  FROM cte
 WHERE rn = 1
;

Fiddle for MySQL 8.0

The result (given some test data).

Note: For this test, a simple album hierarchy (11 albums) is created with a photo for each album (having the same id, coincidentally), just to show a trivial case.

The created_at values were initialized to be different for each photo, to avoid tie issues.

+----+-----------+------+------+----------+-------------+----+
| id | parent_id | _lft | _rgt | photo_id | photo_album | rn |
+----+-----------+------+------+----------+-------------+----+
|  1 |      NULL |    1 |   22 |       10 |          10 |  1 |
|  2 |      NULL |    2 |    9 |        6 |           6 |  1 |
|  3 |      NULL |   10 |   21 |       10 |          10 |  1 |
|  4 |      NULL |    3 |    8 |        6 |           6 |  1 |
|  5 |      NULL |    4 |    5 |        5 |           5 |  1 |
|  6 |      NULL |    6 |    7 |        6 |           6 |  1 |
|  7 |      NULL |   11 |   16 |       10 |          10 |  1 |
|  8 |      NULL |   17 |   18 |        8 |           8 |  1 |
|  9 |      NULL |   19 |   20 |        9 |           9 |  1 |
| 10 |      NULL |   12 |   13 |       10 |          10 |  1 |
| 11 |      NULL |   14 |   15 |       11 |          11 |  1 |
+----+-----------+------+------+----------+-------------+----+
Jon Armstrong
  • 4,654
  • 2
  • 12
  • 14