1

I am testing against the Sakila database, see http://www.postgresqltutorial.com/postgresql-sample-database/. This database holds three relations:

  • film: film_id, title
  • actor: actor_id, first_name
  • film_actor: film_id, actor_id

I want to list all films and for each film, I want to to list all actors playing in that particular film. I ended with the following query:

select   film_id, title, array
         (
           select     first_name
           from       actor
           inner join film_actor
           on         actor.actor_id = film_actor.actor_id
           where      film_actor.film_id = film.film_id
         ) as actors
from     film
order by title;

Conceptually, this is a 1 + n query:

one query: get films
n queries: for each film f
             f.actors = array(get actors playing in f)

I always understood that 1 + n queries should be avoided at all cost, as this does not scale well.

So this made me wondering: how does PostgreSQL implement this internally? Let's say we have 1000 movies, does it internally execute 1000 select actor.first_name from actor inner join ... queries? Or is PostgreSQL smarter about this and does it something like the following?

1. one query:  get films
2. one query:  get actors related to these films while keeping reference to film_id
3. internally: for each film f
                 f.actors = array(subset of (2) according to film_id)

This does 1 + 1 queries.

Jelly Orns
  • 197
  • 1
  • 2
  • 8
  • 5
    https://www.postgresql.org/docs/current/static/sql-explain.html – zerkms May 17 '18 at 08:16
  • 2
    Also: https://www.postgresql.org/docs/current/static/using-explain.html and https://use-the-index-luke.com/sql/join –  May 17 '18 at 08:35
  • GREAT! I did not know such analysis tools exist. I am eager to learn more. – Jelly Orns May 17 '18 at 08:48
  • What should I do now with the question? I could start learning `explain` and answer my question in a few days? – Jelly Orns May 17 '18 at 08:59
  • Quick tip unrelated to your question but it helps simplify some queries - when joining 2 tables using only columns with identical names you can make use of `using`. `select * from actor inner join film_actor using(actor_id)` – Scoots May 17 '18 at 10:34
  • That's a neat trick, I did not know that either. Thank you for advising me. – Jelly Orns May 17 '18 at 17:26

2 Answers2

1

This is perhaps more appropriate for a comment, but it is too long.

Although I follow the logic of your query, I much prefer expressing it as:

select f.film_id, f.title,
       (select array_agg(a.first_name)
        from actor a inner join
             film_actor fa
             on a.actor_id = fa.actor_id
        where fa.film_id = f.film_id
       ) as actors
from film f
order by f.title;

The explicit array_agg() clarifies the logic. You are aggregating the subquery, bringing the results together as an array, and then including that as a column in the outer query.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
1

You are thinking in nested loops. This is something you should overcome when working with relational database (unless you are using MySQL).

What you describe as "1 + n" is a nested loop: you scan one table and for each row found, you scan the other table.

The way your SQL query is written, PostgreSQL has no choice but to execute a nested loop.

This is good as long as the outer table (film in your example) has few rows. Performance deteriorates rapidly once the outer table gets bigger.

In addition to nested loops, PostgreSQL has two other join strategies:

  • Hash join: The inner relation is scanned and a hash structure is created where the hash key is the join key. Then the outer relation is scanned, and the hash is probed for each row found.

    Think of it as a kind of hash join, but on the inner side you have an efficient in-memory data structure.

  • Merge join: Both tables are sorted on the join key and merged by scanning the results simultaneously.

You are advised to write your query without “correlated subqueries” so that PostgreSQL can choose the optimal join strategy:

SELECT film_id, f.title, array_agg(a.first_name)
FROM film f
   LEFT JOIN film_actor fa USING (film_id)
   LEFT JOIN actor a USING (actor_id)
GROUP BY f.title
ORDER BY f.title;

The left outer join is used so that you get a result even if a film has no actors.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • This is very interesting, thanks! Am I correct that I have to add a `GROUP BY f.film_id` in your query? Doesn't grouping introduce extra performance penalties which we don't have with the correlated subquery? – Jelly Orns May 20 '18 at 09:55
  • So in general, it is better to replace a correlated subquery by a left join? – Jelly Orns May 20 '18 at 09:56
  • I am asking myself why the query planner is unable do transform the correlated subquery to a left join by itself: https://stackoverflow.com/questions/50434001/why-is-the-query-planner-unable-to-transform-a-correlated-subquery – Jelly Orns May 20 '18 at 10:44
  • I forgot the `GROUB BY` - fixed. Of course that cocts something, but so does a nested lop. Compare them with `EXPLAIN (ANALYZE)`, then you will see what is faster. Yes, it is better to write a left join than a correlated subquery. Perhaps the optimizer is smart enough to flatten the subquery, but there is no need to make it harder for PostgreSQL than necessary. – Laurenz Albe May 20 '18 at 19:15
  • I agree with that, it's indeed better to read and we should not make it more difficult than needed. – Jelly Orns May 21 '18 at 07:56