3

I have tables called Movie, Genre and Keyword, from which I have created a view called 'genkeyword'. The view 'genkeyword' has a lot of tuples, so it can be accessed at DB Fiddle.

I have the following query:

SELECT title, 
       year, 
       Count(DISTINCT genre)   AS genre_freq, 
       Count(DISTINCT keyword) AS keyword_freq 
FROM   genkeyword 
WHERE  ( genre IN (SELECT genre 
                   FROM   genkeyword 
                   WHERE  title = 'Harry Potter and the  Deathly Hallows') 
          OR keyword IN (SELECT keyword 
                         FROM   genkeyword 
                         WHERE  title = 'Harry Potter and the  Deathly Hallows') ) 
       AND title <> 'Harry Potter and the Deathly Hallows' 
GROUP  BY title, 
          year 
ORDER  BY genre_freq DESC, 
          keyword_freq DESC; 

What I am intending to do with this query is to get the genre and keyword frequency for each movie that has genres and keywords that are in common with Harry Potter: The output should be:

title                      |      genre_freq    |    keyword_freq
Cinderella                        2                        2
The Shape of Water                2                        1
How to Train Your Dragon          2                        0
Enchanted                         1                        3

I know that the query is not correct, since the I get the following output instead:

    title                      |      genre_freq    |    keyword_freq
    The Shape of Water                4                  3       
    Enchanted                         3                  4
    Cinderella                        2                  5
    How to Train Your Dragon          2                  3              

However, I would like to clarify my understanding about how the query works.

In the 'where' clause of my query:

where (genre in (select genre from genkeyword where title='Harry Potter') or 
keyword in (select keyword from genkeyword where title='Harry Potter')) 

Am I right in saying that there are two result sets generated, one containing all the tuples which have a genre that is in Harry Potter (let this be R1) and the other containing all the tuples that have a keyword that is in Harry Potter (let this be R2)?

If the tuple under consideration contains a genre that is in the genre result set R1 or a keyword that is in the keyword result set R2, then the genre/keyword is counted. I am not sure how count(distinct genre) and count(distinct keyword) works in this case. If the tuple contains a genre that is in R1, is only the genre counted or is the keyword counted as well? This is the same for the case when the tuple contains a keyword in R2, is the genre counted as well as the keyword?

I don't understand why I am getting the genre_freq and keyword_freq values wrong from my query. This is because I don't fully understand how the genre and keyword frequencies are getting counted behind-the-scenes. Any insights are appreciated.

ceno980
  • 2,003
  • 1
  • 19
  • 37

4 Answers4

0

One of the best-asked questions I have seen so far on SO.

To answer your question. The OR clause basically pastes the result of both the keyword part and the genre part below each other. SQL works in rows (or records), so you should always think in rows.

First, it selects all the rows containing the same genre like Harry Potter. Then it selects all the rows containing the keywords. Then it performs the count. Obviously, this is too high, because you will also get all the records that do not have the same genre, but do have overlapping keywords. You will also get all rows that do have overlapping genres, but not overlapping keywords.

To properly count the records simply change OR to AND. This will only select the records that have the same genre as well as contain keywords. Counting these will produce the correct result.

Imre_G
  • 2,468
  • 1
  • 17
  • 33
  • 'OR to AND' trouble is that if the movie has a genre but no keyword then it doesn't appear at all (and the inverse is true) – P.Salmon Apr 19 '19 at 09:17
  • Except it won't, because one row has 0 hits on genre (or keyword I can't remember), so this row will then be excluded. Which incidentally makes it an even better question? ;0 – Richard Hansell Apr 19 '19 at 09:17
  • I agree that this question deserved a better thought out answer :) – Imre_G Apr 19 '19 at 09:35
0

As Imre_G said, this is a good question, and his explanation of what is going wrong is spot on. You are basically picking genres and keywords you don't want, then counting these up because they share a common element.

This is one way to fix it, maybe not the best, but the simplest:

SELECT
    COALESCE(a.title, b.title) AS title,
    COALESCE(a.year, b.year) AS year,
    a.genre_freq,
    b.keyword_freq
FROM
(SELECT title, year, count(distinct genre) as genre_freq FROM genkeyword where (genre in 
(select genre from genkeyword where title='Harry Potter and the Deathly Hallows') )
AND title <> 'Harry Potter and the Deathly Hallows'
group by title, year) a
LEFT JOIN
(select title, year, 
count(distinct keyword) as keyword_freq 
from genkeyword
where keyword in (select keyword from genkeyword where title='Harry Potter and the Deathly Hallows')
 and title <> 'Harry Potter and the Deathly Hallows' group by title, year) b
 ON b.title = a.title;

Now that solution only works if there's a keyword match for a movie. The proper solution would be to replace the LEFT JOIN with a FULL OUTER JOIN, but MySQL doesn't support FULL OUTER JOINs for some reason. There is a solution for this as well, but it's long, and involves lots of UNIONs ;(

How to do a FULL OUTER JOIN in MySQL?

Richard Hansell
  • 5,315
  • 1
  • 16
  • 35
0

You could invert your logic and drive from genre and keywords using a sub query before totaling

select title,year,
        sum(case when src = 'g' then 1 else 0 end) as genre,
        sum(case when src = 'k' then 1 else 0 end) as keyword
from
(
select 'g' as src, g1.title ,g1.year, g1.genre
from genkeyword g
join genkeyword g1 on g1.genre = g.genre
where g.title =  'Harry Potter and the Deathly Hallows' and g1.title <> 'Harry Potter and the Deathly Hallows'
union
select 'k' as src, g1.title ,g1.year, g1.genre
from genkeyword g
join genkeyword g1 on g1.keyword = g.keyword
where g.title =  'Harry Potter and the Deathly Hallows' and g1.title <> 'Harry Potter and the Deathly Hallows'
) s
group by title , year;

+--------------------------+------+-------+---------+
| title                    | year | genre | keyword |
+--------------------------+------+-------+---------+
| Cinderella               | 2015 |     2 |       2 |
| Enchanted                | 2007 |     1 |       3 |
| How to Train Your Dragon | 2010 |     2 |       0 |
| The Shape of Water       | 2017 |     2 |       4 |
+--------------------------+------+-------+---------+
4 rows in set (0.10 sec)
P.Salmon
  • 17,104
  • 2
  • 12
  • 19
  • The select 'g' as src in the nested subquery creates a column with an attribute 'src' where are all the values for 'src' are g. I haven't seen something like this in an SQL query before. Usually in a SELECT statement, you are selecting columns that already exist. Is there a specific term for this feature that enables you to create a new column and fill it with a value of your choice? – ceno980 Apr 20 '19 at 10:08
  • Usually referred to as a dummy column in the context of a union. – P.Salmon Apr 20 '19 at 10:21
0

Try this query.
I haven't used any of the views you created, but you can use those if you want.

MySQL

SET @tmpMovieid = (SELECT DISTINCT id 
                   FROM Movie 
                   WHERE title = 'Harry Potter and the Deathly Hallows');

SELECT id,
       title,
       IFNULL(Max(CASE WHEN coltype = 'genre' THEN col end),   0) AS genre_freq,
       IFNULL(Max(CASE WHEN coltype = 'Keyword' THEN col end), 0) AS keyword_freq

FROM   (SELECT id,
               title,
               Count(g.genre) AS col,
               'genre'        AS colType
        FROM   Movie m
               INNER JOIN Genre g ON m.id = g.Movie_id
        WHERE  g.genre IN (SELECT DISTINCT genre
                           FROM   Genre
                           WHERE  Movie_id = @tmpMovieid)
        GROUP  BY id, title

        UNION ALL

        SELECT id,
               title,
               Count(k.keyword) AS col,
               'Keyword'        AS colType
        FROM   Movie m
               INNER JOIN Keyword k ON m.id = k.Movie_id
        WHERE  k.keyword IN (SELECT DISTINCT keyword
                             FROM   Keyword
                             WHERE  Movie_id = @tmpMovieid)
        GROUP  BY id, title) tmp

WHERE  id <> @tmpMovieid
GROUP  BY id, title
ORDER  BY genre_freq DESC, keyword_freq DESC;

Online Demo: https://www.db-fiddle.com/f/s1xLQ6r4Zwi5hVjCsdcwV8/0


SQL Server
Note: Since you have used 'text' as some of the column data types, it needed to convert for some operations. But then again, since you're using MySQL, you don't need this. I wrote this anyway to show you the difference and for fun.

DECLARE @tmpMovieID INT;
SET @tmpMovieID = (SELECT DISTINCT id
                   FROM   movie
                   WHERE  Cast(title AS NVARCHAR(MAX)) = 'Harry Potter and the Deathly Hallows');

SELECT tmpGenre.id                  AS id,
       tmpGenre.title               AS title,
       ISNULL(tmpGenre.genre, 0)    AS genre,
       ISNULL(tmpKeyword.keyword,0) AS keyword

FROM   (SELECT id,
               Cast(title AS NVARCHAR(MAX))          AS title,
               Count(Cast(g.genre AS NVARCHAR(MAX))) AS genre
        FROM   movie m
               INNER JOIN genre g ON m.id = g.movie_id
        WHERE  Cast(g.genre AS NVARCHAR(MAX)) IN (SELECT DISTINCT Cast(genre AS NVARCHAR(MAX))
                                                 FROM   genre
                                                 WHERE  movie_id = @tmpMovieID)
        GROUP  BY id, Cast(title AS NVARCHAR(MAX))) tmpGenre

       FULL OUTER JOIN (SELECT id,
                               Cast(title AS NVARCHAR(MAX))            AS title,
                               Count(Cast(k.keyword AS NVARCHAR(MAX))) AS Keyword
                        FROM   movie m
                               INNER JOIN keyword k ON m.id = k.movie_id
                        WHERE  Cast(k.keyword AS NVARCHAR(MAX)) IN
                               (SELECT DISTINCT Cast(keyword AS NVARCHAR(MAX))
                                FROM   keyword
                                WHERE  movie_id = @tmpMovieID)
                        GROUP  BY id, Cast(title AS NVARCHAR(MAX))) tmpKeyword

                    ON tmpGenre.id = tmpKeyword.id
WHERE  tmpGenre.id <> @tmpMovieID
ORDER  BY tmpGenre.genre DESC, tmpKeyword.keyword DESC;

Online Demo: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=a1ee14e1e08b7e55eff2e8e94f89a287&hide=1


Result

+------+---------------------------+-------------+--------------+
| id   |          title            | genre_freq  | keyword_freq |
+------+---------------------------+-------------+--------------+
| 407  | Cinderella                |          2  |            2 |
| 826  | The Shape of Water        |          2  |            1 |
| 523  | How to Train Your Dragon  |          2  |            0 |
| 799  | Enchanted                 |          1  |            3 |
+------+---------------------------+-------------+--------------+

By the way, thank you for asking a clear question and giving out table schema, sample data, and desired output.

DxTx
  • 3,049
  • 3
  • 23
  • 34