3

In How does PostgreSQL approach a 1 + n query?, I learned that a correlated subquery can be rewritten as a left join:

select   film_id, title,
         (
           select     array_agg(first_name)
           from       actor
           inner join film_actor using(actor_id)
           where      film_actor.film_id = film.film_id
         ) as actors
from     film
order by title;

to

select   f.film_id, f.title, array_agg(a.first_name)
from     film f
   left join film_actor fa using(film_id)
   left join actor      a  using(actor_id)
group by f.film_id
order by f.title;

Bot queries return the same results, but the second query performs better.

This makes me wonder: why is the query planner unable to do such transformations by itself?

I can see why not all correlated subqueries could be transformed to a join, but I don't see any issues with this particular query.

update performance

I tried to compare the performance as following. I executed 2 consecutive loops of 100 times the first query, followed by 2 consecutive loops of 100 times the second query. I ignored the first loop in both cases, as I considered that a warm-up loop.

I get 16 seconds for 100x the first query and 11 seconds for 100x the second query.

The explains are as following:

correlated subquery:

 Index Scan using idx_title on film  (cost=0.28..24949.50 rows=1000 width=51) (actual time=0.690..74.828 rows=1000 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=24.84..24.85 rows=1 width=32) (actual time=0.068..0.068 rows=1 loops=1000)
       ->  Hash Join  (cost=10.82..24.82 rows=5 width=6) (actual time=0.034..0.055 rows=5 loops=1000)
         Hash Cond: (film_actor.actor_id = actor.actor_id)
         ->  Bitmap Heap Scan on film_actor  (cost=4.32..18.26 rows=5 width=2) (actual time=0.025..0.040 rows=5 loops=1000)
               Recheck Cond: (film_id = film.film_id)
               Heap Blocks: exact=5075
               ->  Bitmap Index Scan on idx_fk_film_id  (cost=0.00..4.32 rows=5 width=0) (actual time=0.015..0.015 rows=5 loops=1000)
                 Index Cond: (film_id = film.film_id)
         ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.338..0.338 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 17kB
               ->  Seq Scan on actor  (cost=0.00..4.00 rows=200 width=10) (actual time=0.021..0.133 rows=200 loops=1)
 Planning time: 1.277 ms
 Execution time: 75.525 ms

join:

 Sort  (cost=748.60..751.10 rows=1000 width=51) (actual time=35.865..36.060 rows=1000 loops=1)
   Sort Key: f.title
   Sort Method: quicksort  Memory: 199kB
   ->  GroupAggregate  (cost=645.31..698.78 rows=1000 width=51) (actual time=23.953..34.204 rows=1000 loops=1)
     Group Key: f.film_id
     ->  Sort  (cost=645.31..658.97 rows=5462 width=25) (actual time=23.910..25.210 rows=5465 loops=1)
           Sort Key: f.film_id
           Sort Method: quicksort  Memory: 619kB
           ->  Hash Left Join  (cost=84.00..306.25 rows=5462 width=25) (actual time=2.098..16.237 rows=5465 loops=1)
             Hash Cond: (fa.actor_id = a.actor_id)
             ->  Hash Right Join  (cost=77.50..231.03 rows=5462 width=21) (actual time=1.786..10.636 rows=5465 loops=1)
               Hash Cond: (fa.film_id = f.film_id)
               ->  Seq Scan on film_actor fa  (cost=0.00..84.62 rows=5462 width=4) (actual time=0.018..2.221 rows=5462 loops=1)
               ->  Hash  (cost=65.00..65.00 rows=1000 width=19) (actual time=1.753..1.753 rows=1000 loops=1)
                 Buckets: 1024  Batches: 1  Memory Usage: 59kB
                 ->  Seq Scan on film f  (cost=0.00..65.00 rows=1000 width=19) (actual time=0.029..0.819 rows=1000 loops=1)
             ->  Hash  (cost=4.00..4.00 rows=200 width=10) (actual time=0.286..0.286 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 17kB
               ->  Seq Scan on actor a  (cost=0.00..4.00 rows=200 width=10) (actual time=0.016..0.114 rows=200 loops=1)
 Planning time: 1.648 ms
 Execution time: 36.599 ms
Jelly Orns
  • 197
  • 1
  • 2
  • 8
  • https://stackoverflow.com/a/50426511/905902in your case the title is probably functionally dependent on `film_id` but the optimiiser does no t understand this. – wildplasser May 20 '18 at 10:53
  • What answer do you expect to the first query when there is a film with no actors? What about the second query? – WW. May 20 '18 at 10:56
  • WW: `null` as actors is fine in both cases. I could add a `coalesce(...)` if needed. – Jelly Orns May 20 '18 at 10:59
  • `explain ANALYZE`, please (and: compare the expected vs observed, the expected=1000 is suspect ...) – wildplasser May 20 '18 at 11:11
  • I added the `explain analyze`. – Jelly Orns May 20 '18 at 11:14
  • `left join actor` != `inner join actor` [But I don't think the optimizer is able to unpack scalar subqueries] – wildplasser May 20 '18 at 11:25
  • I am not really sure what you meant by saying `left` != `inner`. I understand the difference between `left join` and `inner join`, but I don't see the relevance here. In the first query, I have `inner join` in the subquery, as I only want actors related to that film. If no such actors exist, we end up with `null`. In the second query, I have `left join`, to make sure that we remain returning all films. Also here, if no such actors exist, w end up with `null` again. – Jelly Orns May 20 '18 at 11:30
  • BTW: the resultsets for these queries are too small. Once you outgrow memory, the second query would lose. [but the optimiser could choose a different plan, avoiding the double sort and the hash tables] – wildplasser May 20 '18 at 14:36
  • @wildplasser So what are you proposing then? I should revert to the correlated subquery? – Jelly Orns May 20 '18 at 18:22
  • With execution times of 40ms and 80ms, there is little to compare, even for the optimiser(this is the region of the problem-space where the cost calculation still make little sense) . Dont try to solve problems until you face them. – wildplasser May 20 '18 at 18:39
  • @wildplasser Why do you expect the second query to perform worse with bigger tables? – Laurenz Albe May 20 '18 at 19:27
  • Sorry, I never said that. I expect the optimizer to pick a completely **different** plan, once some treshold is hit, and *probably* the two sorts will be avoided, assuming usable indexes. (the first query could change as well) – wildplasser May 20 '18 at 19:33

2 Answers2

2

I'm a little surprised that the second one performs better. For instance, the second one should get a syntax error, because the order by is before the group by -- but I get your point.

But the answer to your question is that although SQL is a descriptive language rather than a procedural language, the structure of the query can -- for some databases -- affect the execution plan. If you have looked at the explains, then this is clearly the case for these two queries.

The more important answer is that although the queries look the same, they are not semantically equal. In particular, if film.film_id is not unique, they return different answers.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
2

This is too much for a comment.

The rewrite of your Correlated Subquery should be like this:

select film_id, title, a.actors
from   film
left join
  (         
           select     film_actor.film_id, array_agg(first_name) as actors
           from       actor
           inner join film_actor using(actor_id)
           group by   film_actor.film_id
  ) as a
on a.film_id = film.film_id
order by title;

Regarding performance, Scalar Correlated Subqueries simply seem to be hard for optimizers, I wouldn't expect them to perform the same or better than a manual rewrite.

dnoeth
  • 59,503
  • 4
  • 39
  • 56
  • Thank you! Could you please explain why the left join should be like this, and what is wrong when comparing to 'my' version? – Jelly Orns May 20 '18 at 17:24
  • 1
    @JellyOrns: Of course again :-) The Scalar Subquery can be replaced by a Left Join and in your case the Subquery is based on an Inner Join of two tables. Two Left Joins might return a row from `film_actor` with no matching row in `actor`. – dnoeth May 20 '18 at 17:57
  • Got it! Although I believe that would not be possible in this case, as there is a foreign key from `film_actor.actor_id` to `actor.actor_id` (which I did not mention, so you are correct to assume there is none). That said, if we add a `where` clause and/or `limit` clause to the outer/root `film` query, I assume your query will perform badly? That is, I believe this approach is only desired if we are fetching *all* films and not *some* films? – Jelly Orns May 20 '18 at 18:09
  • This is only accurate if you *know* that `film.film_id` is unique. That is a reasonable assumption -- for a human. I don't know if any database uses unique constraints for rewriting query plans. – Gordon Linoff May 20 '18 at 19:08
  • @GordonLinoff Could you please explain what you mean by that? I am not sure I am following you now... – Jelly Orns May 21 '18 at 07:53