4

I am trying to write a query (PostgreSQL) to get "Movies with highest number of awards in year 2012."

I have following tables:

CREATE TABLE Award(
    ID_AWARD bigserial CONSTRAINT Award_pk PRIMARY KEY,
    award_name VARCHAR(90),
    category VARCHAR(90),
    award_year integer,
    CONSTRAINT award_unique UNIQUE (award_name, category, award_year));

CREATE TABLE AwardWinner(
    ID_AWARD integer,
    ID_ACTOR integer,
    ID_MOVIE integer,
    CONSTRAINT AwardWinner_pk PRIMARY KEY (ID_AWARD));

And I written following query, which gives correct results, but there's is quite a lot of code duplication I think.

select * from 
(select id_movie, count(id_movie) as awards 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB
where awards = (select max(count) from 
(select id_movie, count(id_movie) 
from Award natural join awardwinner 
where award_year = 2012 group by id_movie) as SUB2);

So SUB and SUB2 are exactly the same subquery. Is there a better way to do this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Kuba Spatny
  • 26,618
  • 9
  • 40
  • 63
  • Where does `awards` in `where awards = (` come from? I don't see where/how it's defined. Copy and paste error? Or am I just missing something? – Achim Nov 16 '13 at 17:51
  • @Achim yeah sorry, as you said copy&paste error. It was an alias for count(id_movie). – Kuba Spatny Nov 16 '13 at 18:00
  • This link will show you how to do top n queries in a variety of db engines, including yours. http://www.petefreitag.com/item/59.cfm – Dan Bracuk Nov 16 '13 at 18:02
  • @DanBracuk I don't really see how I it would help.. I still need to now the number "n" for which I would need to find the max value. – Kuba Spatny Nov 16 '13 at 18:06
  • Good point. The rank function described here might help. http://www.postgresql.org/docs/9.1/static/tutorial-window.html – Dan Bracuk Nov 16 '13 at 18:10
  • @DanBracuk Thanks, that is some very helpful information! – Kuba Spatny Nov 16 '13 at 18:15

3 Answers3

7

Well you can use common table expression to avoid code duplication:

with cte_s as (
   select id_movie, count(id_movie) as awards
   from Award natural join awardwinner 
   where award_year = 2012
   group by id_movie
)
select
    sub.id_movie, sub.awards
from cte_s as sub
where sub.awards = (select max(sub2.awards) from cte_s as sub2)

or you can do something like this with window function (untested, but I think PostgreSQL allows this):

with cte_s as (
    select
        id_movie,
        count(id_movie) as awards,
        max(count(id_movie)) over() as max_awards
    from Award natural join awardwinner 
    where award_year = 2012
    group by id_movie
)
select id_movie
from cte_s
where max_awards = awards

Another way to do this could be to use rank() function (untested, may be you have to use two cte instead of one):

with cte_s as (
    select
        id_movie,
        count(id_movie) as awards,
        rank() over(order by count(id_movie) desc) as rnk
    from Award natural join awardwinner 
    where award_year = 2012
    group by id_movie
)
select id_movie
from cte_s
where rnk = 1

update When I've created this answer, my main goal was to show how to use cte to avoid code duplication. In genearal, it's better to avoid using cte more than one time in query if it's possible - first query uses 2 table scan (or index seek) and second and third uses only one, so I've should specify that it's better to go with these queries. Anyway, @Erwin made this tests in his answer. Just to add to his great major points:

  • I also advice against natural join because of error-prone nature of this. Actually, my main RDBMS is SQL Server which are not support it so I'm more used to explicit outer/inner join.
  • It's good habit to always use aliases in your queries, so you can avoid strange results.
  • This could be totally subjective thing, but usually if I'm using some table only to filter out rows from main table of the query (like in this query, we just want to get awards for year 2012 and just filter rows from awardwinner), I prefer not to use join, but use exists or in instead, it seems more logical for me.
So final query could be:
with cte_s as (
    select
        aw.id_movie,
        count(*) as awards,
        rank() over(order by count(*) desc) as rnk
    from awardwinner as aw
    where
        exists (
            select *
            from award as a
            where a.id_award = aw.id_award and a.award_year = 2012
        )
    group by aw.id_movie
)
select id_movie
from cte_s
where rnk = 1
Community
  • 1
  • 1
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
2

Get all winning movies

SELECT id_movie, awards
FROM  (
   SELECT aw.id_movie, count(*) AS awards
         ,rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
   FROM   award       a
   JOIN   awardwinner aw USING (id_award)
   WHERE  a.award_year = 2012
   GROUP  BY aw.id_movie
   ) sub
WHERE  rnk = 1;

Major points

  • This should be simpler and faster than suggestions so far. Test with EXPLAIN ANALYZE.

  • There are cases where CTEs are instrumental to avoid code duplication. But not in this time: a subquery does the job just fine and is usually faster.

  • You can run a window function OVER an aggregate function on the same query level. That's why this works:

    rank() OVER (ORDER BY count(aw.id_movie) DESC) AS rnk
    
  • I'd suggest to use explicit column names in the JOIN condition instead of NATURAL JOIN, which is prone to breakage if you later change / add columns to the underlying tables.
    The JOIN condition with USING is almost as short, but doesn't break as easily.

  • Since id_movie cannot be NULL (ruled out by the JOIN condition and also part of the pk) it is shorter ans slightly faster to use count(*) instead. Same result.

Just one movie

Shorter and faster, yet, if you only need one winner:

SELECT aw.id_movie, count(*) AS awards
FROM   award       a
JOIN   awardwinner aw USING (id_award)
WHERE  a.award_year = 2012
GROUP  BY 1
ORDER  BY 2 DESC, 1 -- as tie breaker
LIMIT  1

Using positional references (1, 2) here as shorthand.
I added id_movie to ORDER BY as tie breaker in case multiple movies should qualify for the win.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thank you for such thorough explanation, especially comparing the time consumptions. I ended up using rank function as many here suggested, still your query is the neatest of them all. – Kuba Spatny Nov 20 '13 at 08:01
0

Don't you need something like this?

SELECT ID_MOVIE, COUNT(*)
FROM AwardWinner
JOIN Award ON Award.ID_AWARD = AwardWinner.ID_AWARD
WHERE award_year = 2012
GROUP BY ID_MOVIE
ORDER BY COUNT(*) DESC

Or possibly (depends on what you're looking for):

SELECT ID_MOVIE, COUNT(DISTINCT AwardWinner.ID_AWARD)
FROM AwardWinner
JOIN Award ON Award.ID_AWARD = AwardWinner.ID_AWARD
WHERE award_year = 2012
GROUP BY ID_MOVIE
ORDER BY COUNT(*) DESC
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • Actually I just need one movie with the highest count, or several movies if they have the same number of awards (the maximum). – Kuba Spatny Nov 16 '13 at 18:02