PostgreSQL NOT IN does not work correctly with JOIN

Question

I have trivial tables post, tag and post_tags in a trivial Many-To-Many relationship. I want to select some posts by including and excluding some tags. I tried many variations of SQL queries, but none of them works for excluding tags.
I started from a query like this:

SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id 
INNER JOIN tag ON post_tags.tag_id = tag.id
WHERE tag.name IN ('Science','Culture')
    AND tag.name NOT IN ('War', 'Crime')
GROUP BY post.id 
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;

But, unfortunately, this does not work. I see posts with tag "War" in result set. Then I tried to move the NOT IN condition to a separate subquery on post_tags and join to it:

SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id 
INNER JOIN (SELECT * FROM tag WHERE name NOT IN ('War', 'Crime')) AS tags 
    ON post_tags.tag_id = tags.id
WHERE tags.name IN ('Science','Culture')        
GROUP BY post.id 
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;

Even tried to exclude some posts in first JOIN like this:

SELECT post.* FROM post
INNER JOIN post_tags ON post.id = post_tags.post_id 
    AND post_tags.tag_id NOT IN (SELECT id FROM tag WHERE name IN ('War', 'Crime'))
INNER JOIN tag ON post_tags.tag_id = tag.id
WHERE tag.name IN ('Science','Culture')        
GROUP BY post.id 
HAVING COUNT(post_tags.id) > 1
ORDER BY post.rating DESC
LIMIT 50;

But none of this works. I am especially confused about second query (joining with filtered result set instead of table).
Using PostgreSQL version 9.3, OS Ubuntu 14.04.
Any thoughts?

score 3 · Accepted Answer · answered Nov 07 '14 at 23:59

It is working fine. It is your logic that is off. You are filtering out the very tags that you want to check for. So, they are not part of the check.

Instead, move the conditions to the having clause:

SELECT p.*
FROM post p INNER JOIN
     post_tags pt
     ON p.id = pt.post_id INNER JOIN
     tag t
     ON pt.tag_id = t.id
WHERE t.name IN ('Science', 'Culture', 'War', 'Crime')
GROUP BY p.id 
HAVING SUM(CASE WHEN t.name IN ('Science', 'Culture') THEN 1 ELSE 0 END) > 1 AND
       SUM(CASE WHEN t.name IN ('War', 'Crime') THEN 1 ELSE 0 END) = 0
ORDER BY p.rating DESC;

There is a difference between ignoring a value (in the where clause) versus checking that it is not there (in the having clause).

score 2 · Answer 2 · edited May 23 '17 at 12:20

This is an application of relational-division. Check out the tag description.

You have to define what you want exactly. Posts with one of the "good" tags and none of the "bad" tags? Or all of the good tags?

The best query technique depends on the table layout. Typically we'd assume referential integrity and that (post_id, tag_id) is defined unique in post_tags, but that's not defined.

Assuming that, and describing your problem as:

Return the 50 posts with the highest rating with at least one of the tags ('Science','Culture') and none of the tags ('War', 'Crime').

We can translate this plain English sentence into SQL directly:

SELECT p.*
FROM   post p
WHERE  EXISTS (              -- at least one of the tags ('Science','Culture')
   SELECT 1
   FROM   tag t
   JOIN   post_tags pt ON pt.tag_id = t.id
   WHERE  pt.post_id = p.id 
   AND    t.name IN ('Science', 'Culture')
AND    NOT EXISTS (          -- none of the tags ('War', 'Crime')
   SELECT 1
   FROM   tag t
   JOIN   post_tags pt ON pt.tag_id = t.id
   WHERE  pt.post_id = p.id 
   AND    t.name IN ('War', 'Crime')
ORDER  BY p.rating DESC      -- with the highest rating
LIMIT  50;                   -- 50 posts

This is typically faster than grouping rows and counting - and also works if (post_id, tag_id) is not unique.

More techniques for relational division:

How to filter SQL results in a has-many-through relation

I must say that your solution is 1.5 times faster on my database than the first solution. Thanks! — Void Floyd, Nov 25 '14 at 23:19
@VoidFloyd: So what made you accept the less efficient solution again? — Erwin Brandstetter, Nov 30 '14 at 04:26
On the larger amounts of data this solution accidentally becomes more and more slow. The first solution is faster. But the fastest solution is to join post_tags table twice - the first time that's inner join with IN condition, and the second time that's outer join with NOT IN condition. Then we can count pt.id and pt2.id and make HAVING condition on that count - 2 and 0, respectively. Also, of course that's better to not join tag tabe itself, better to get needed ids from it (or from request, depends on application) and use them in joiningconditions of post_tags. — Void Floyd, Dec 01 '14 at 01:43
@VoidFloyd: Yes, two joins, that was the winner of the benchmark in the linked answer as well. — Erwin Brandstetter, Dec 01 '14 at 03:22

PostgreSQL NOT IN does not work correctly with JOIN

2 Answers2