5

I am trying to make the query to get each year with a count of the number of movies in that year that had casts that all were not male (for each year, count the number of movies in that year that had no males).

These are the tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
CASTS (pid, mid, role)  -- pid refers to actor id, mid refers to movie id

This is what I indexed (id for the tables are primary keys, so they're already indexed, or so I assume):

CREATE INDEX gender_index on actor(gender);
CREATE INDEX movie_name_index on movie(name);
CREATE INDEX movie_year_index on movie(year);
CREATE INDEX casts_index on casts(pid, mid, role);
CREATE INDEX casts_pid_index on casts(pid);
CREATE INDEX casts_mid_index on casts(mid);
CREATE INDEX casts_role_index on casts(role);

This is my query:

SELECT m.year, count(m.id)
FROM movie as m
WHERE m.id NOT IN (
    SELECT DISTINCT m.id
    FROM movie as m, casts as c, actor as a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year

The query takes forever (and never completes), so how can I make this fast? Does using NOT EXISTS help, although I thought the optimizer takes care of this? Do I need to index anything else? Is there another query that is better? I'm using PostgreSQL if this makes any difference.

Here is the EXPLAIN:

"GroupAggregate  (cost=1512539.61..171886457832.52 rows=61 width=8)"
"  Group Key: m.year"
"  ->  Index Scan using movie_year_index on movie m  (cost=1512539.61..171886453988.38 rows=768706 width=8)"
"        Filter: (NOT (SubPlan 1))"
"        SubPlan 1"
"          ->  Materialize  (cost=1512539.18..1732298.66 rows=1537411 width=4)"
"                ->  Unique  (cost=1512539.18..1718605.60 rows=1537411 width=4)"
"                      ->  Merge Join  (cost=1512539.18..1700559.32 rows=7218511 width=4)"
"                            Merge Cond: (m_1.id = c.mid)"
"                            ->  Index Only Scan using movie_pkey on movie m_1  (cost=0.43..57863.94 rows=1537411 width=4)"
"                            ->  Materialize  (cost=1512531.37..1548623.92 rows=7218511 width=4)"
"                                  ->  Sort  (cost=1512531.37..1530577.65 rows=7218511 width=4)"
"                                        Sort Key: c.mid"
"                                        ->  Hash Join  (cost=54546.59..492838.95 rows=7218511 width=4)"
"                                              Hash Cond: (c.pid = a.id)"
"                                              ->  Seq Scan on casts c  (cost=0.00..186246.43 rows=11445843 width=8)"
"                                              ->  Hash  (cost=35248.91..35248.91 rows=1176214 width=4)"
"                                                    ->  Seq Scan on actor a  (cost=0.00..35248.91 rows=1176214 width=4)"
"                                                          Filter: ((gender)::text = 'M'::text)"
Jack
  • 5,264
  • 7
  • 34
  • 43

4 Answers4

4

I would try

SELECT m.year, count(m.id)
FROM movie m
WHERE NOT EXISTS (
    SELECT NULL
    FROM casts c, actor a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year
Luc M
  • 16,630
  • 26
  • 74
  • 89
  • 2
    Why would `NOT EXISTS` work better than `NOT IN`? Shouldn't the query optimizer optimize both and make them have no difference? – Jack Mar 03 '17 at 07:30
  • [See this question/answer](http://stackoverflow.com/questions/24929/difference-between-exists-and-in-in-sql) – Luc M Mar 03 '17 at 13:16
  • @Jack: check the execution plan and you will know –  Mar 03 '17 at 22:38
  • 1
    I found an answer regarding PostgreSQL: https://www.postgresql.org/message-id/web-2291801@davinci.ethosmedia.com – Jack Mar 05 '17 at 22:44
3

I would first drop the distinct. I've run into many instances where a distinct in a subquery creates awful query plans. Also, I would maybe try using a left outer join with a filter for null instead of a not exists. Something like

SELECT m_fems.year, count(m_fems.id)
FROM movie as m_fems
    left outer join movie m_males
        inner join casts c on c.mid = m_males.mid
        inner join actor a on a.gender = 'M' and a.cid = c.cid
    on m_males.mid = m_fems.mid
WHERE m_males.mid is null
GROUP BY m_fems.year
ORDER BY m_fems.year

The potentially many-to-one left outer join won't interfere with your count because any matches are filtered out by m_males.mid is null.

Also, apologies if this isn't exactly proper PostgreSQL. I usually deal with SQL Server / T-SQL

ashja99
  • 378
  • 5
  • 13
2

First, use proper explicit JOIN syntax. Second, use a correlated subquery instead of NOT IN:

SELECT m.year, count(m.id)
FROM movie m
WHERE NOT EXISTS (SELECT 
                  FROM casts c JOIN
                       actor a
                       ON a.id = c.pid
                  WHERE m.id = c.mid AND a.gender = 'M' 
                 )
GROUP BY m.year
ORDER BY m.year;

However, my inclination would be to use conditional aggregation:

SELECT m.year, SUM(CASE WHEN num_m = 0 THEN 1 ELSE 0 END) as cnt
FROM (SELECT m.id, m.year,
             SUM(CASE WHEN a.gender = 'M' THEN 1 ELSE 0 END) as num_m
      FROM movie m JOIN
           casts c 
           ON m.id = c.mid JOIN
           actor a
           ON a.id = c.pid
      GROUP BY m.id, m.year
     ) m
GROUP BY m.year
ORDER BY m.year;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • This puzzles me why using a correlated subquery would work better than the `NOT IN`. Isn't it more inefficient to evaluate the inner query for each row processed by the outer query? And, shouldn't the query optimizer optimize the `NOT IN` so that it's the same query plan as `NOT EXISTS`? – Jack Mar 03 '17 at 07:36
  • @Jack . . . Your version of the query pretty much precludes the optimizer from using indexes for the `NOT IN`. But more importantly, `NOT IN` generally does not do what you want if any of the returned values are `NULL`, so my habit is to use `NOT EXISTS` with a subquery. – Gordon Linoff Mar 03 '17 at 13:24
0

The problem is in using NOT IN with distinct in the subquery. For some reason this messes up the execution plan. The solution is to store the distinct values in a temp table and use this table in the subquery:

SELECT DISTINCT m.id
INTO temp_Table
FROM movie as m, casts as c, actor as a
WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'

SELECT m.year, count(m.id)
FROM movie as m
WHERE m.id NOT IN ( SELECT m.id FROM temp_Table )
GROUP BY m.year
ORDER BY m.year
LoMaPh
  • 1,476
  • 2
  • 20
  • 33