The Question
I have two SQL tables albums
and photos
. Albums can be hierarchically nested as a tree and use the nested set paradigm. Photos are children of albums.
I need an SQL query which returns the ID of the cover photo for each album ID. The cover photo shall be the best rated photo among all recursive child photos of an album.
The solution must work with PostgreSQL, MySQL and SQLite; i.e. it should basically only use standard SQL feature at most only non-standard features which are provided by all DBMS.
I know that similar questions have already been asked (e.g. in "Select first row in each GROUP BY group?"), but I only could find answers which use SQL extensions of specific DBMS.
I already have one approach, but it is prohibitive slow.
The Environment
The albums
table (simplified):
CREATE TABLE albums (
id INTEGER PRIMARY KEY,
parent_id INTEGER,
_lft INTEGER NOT NULL,
_rgt INTEGER NOT NULL,
)
The photos
table (simplified)
CREATE TABLE photos (
id INTEGER PRIMARY KEY,
created_at DATETIME NOT NULL,
album_id INTEGER,
mime_type VARCHAR NOT NULL
width INTEGER NOT NULL,
height INTEGER NOT NULL,
is_starred BOOLEAN NOT NULL DEFAULT false
)
The best-rated photo is defined as the a photo which is preferably starred and then the most recent one.
First Approach: Correct, but Prohibitivly Slow
SELECT
best_child_photo.album_id AS album_id,
best_child_photo.cover_id AS cover_id,
photos.mime_type AS cover_mime_type,
photos.width AS cover_width,
photos.height AS cover_height
FROM (
SELECT
covered_albums.id AS album_id,
(
SELECT p.id
FROM photos AS p
LEFT JOIN albums AS direct_parents ON (direct_parents.id = p.album_id)
WHERE direct_parents._lft >= covered_albums._lft AND direct_parents._rgt <= covered_albums._rgt
ORDER BY p.is_starred DESC, p.created_at DESC
LIMIT 1
) AS cover_id
FROM albums AS covered_albums
) AS best_child_photo
LEFT JOIN photos ON (photos.id = best_child_photo.cover_id);
The slow part is the inner, single-value query which finds the ID of the best child photo for each album ID.
Second Approach: Fast, but Incomplete
A wrong, but much faster query is
SELECT
albums.id AS album_id,
photos.id AS cover_id,
photos.mime_type AS cover_mime_type,
photos.width AS cover_width,
photos.height AS cover_height
FROM albums
LEFT JOIN (
photos
LEFT JOIN albums AS direct_parents
ON (direct_parents.id = photos.album_id)
)
ON (direct_parents._lft >= albums._lft AND direct_parents._rgt <= albums._rgt);
ORDER BY album_id ASC, photos.is_starred DESC, photos.created_at DESC;
It is wrong, because it does not return a single row for each album but maps each album to all of its recursive child photos. Despite the fact that it returns many, many more rows it is two magnitudes faster then the first query. With the second approach the query planner can use its index tree to perform the left join.
Discussion
As a rule of thumb one can say that: "Sort first, limit to 1, join last" is slower than "Join first (with everything), sort last". As you see, the second approach misses the "limit to 1" step.
So I wonder if it is possible to use something which is based on the second approach and then filter for the first row in each partition with the same album_id
. Unfortunately, something like
SELECT
albums.id AS album_id,
FIRST(photos.id) AS cover_id,
FIRST(photos.type) AS cover_type
FROM ...
...
ORDER BY album_id ASC, photos.is_starred DESC, photos.created_at DESC
GROUP BY album_id
is invalid, because there is no aggregate function FIRST
.
Any ideas?
Final remarks: I know that MySQL allows columns in the SELECT
-clause which are neither grouped nor aggregate functions. In this case MySQL uses values of the first row which is exactly what I need, but it is MySQL-specific. With PostgreSQL I could use DISTINCT ON (album_id)
which also would give my the desired result, but DISTINCT ON
is only supported by PostgreSQL.