How to express this SQL query without subqueries

Question

I have a database that contains project, keyphrase, article, and category tables. Projects can have multiple keyphrases and multiple categories and so can articles. So I have respective join tables for each relationship.

I have a query that finds all of the articles that share both a keyphrase and a category with a specific project (i.e. 26).

SELECT DISTINCT article.id, article.summary
FROM article, articles_categories, articles_keyphrases 
WHERE id = articles_categories.article_id AND id = articles_keyphrases.article_id 
AND category_id IN (SELECT category_id FROM projects_categories WHERE project_id = 26)
AND keyphrase_id IN (SELECT keyphrase_id FROM projects_keyphrases WHERE project_id = 26)

I'm trying to find a different way to write this so that there aren't any sub queries. This is mostly just a learning exercise on my part because I would really like to see different approaches to this query.

@Andronicus I know. This is a learning exercise. I'm not looking to necessarily improve performance. — Travis M, Apr 26 '20 at 04:22
Travis what you're trying to do is an overkill to the database (it's not a proper use of joins) — Andronicus, Apr 26 '20 at 04:41
JOIN vs IN (vs EXISTS) is an easily found (& clearly expected) faq. Before considering posting please read the manual & google any error message or many clear, concise & precise phrasings of your question/problem/goal, with & without your particular strings/names & site:stackoverflow.com & tags; read many answers. If you post a question, use one phrasing as title. Reflect your research. See [ask] & the voting arrow mouseover texts. Learn how to use google. — philipxy, Apr 26 '20 at 06:48

score 3 · Accepted Answer · answered Apr 26 '20 at 04:36

3

You can JOIN all four tables to get the same result. Note that comma joins have been superseded for decades, you should write ANSI joins instead:

SELECT DISTINCT id, summary
FROM article a
JOIN articles_categories ac ON ac.article_id = a.id
JOIN articles_keyphrases ak ON ak.article_id = a.id
JOIN projects_categories pc ON pc.category_id = ac.category_id AND pc.project_id = 26
JOIN projects_keyphrases pk ON pk.keyphrase_id = ak.keyphrase_id AND pk.project_id = 26

answered Apr 26 '20 at 04:36

Nick

138,499
22
57
95

Nice solution, I didn't notice the two tables in `from` are different from those in inner selects (+1). Although with 2 resulting columns and no transient relationships all those joins are just a waste of resources and should be converted into `exists` – Andronicus Apr 26 '20 at 04:40
1

@Andronicus I agree on the exists - post it as an answer and I will happy repay the favour. – Nick Apr 26 '20 at 04:41

score 2 · Answer 2 · answered Apr 26 '20 at 16:03

The DISTINCT is not needed (and always a red flag, IMHO).

Since only fields from the a table are selected, you can avoid generating duplicates (and the need to suppres them later) by squeezing all unneeded table references into the exists() - subqueries:

SELECT id, summary
FROM article a
WHERE EXISTS (
        SELECT * FROM articles_categories ac 
        JOIN projects_categories pc
            ON pc.category_id = ac.category_id AND pc.project_id = 26
        WHERE ac.article_id = a.id
        )
AND EXISTS      (
        SELECT * FROM articles_keyphrases ak 
        JOIN projects_keyphrases pk
            ON pk.keyphrase_id = ak.keyphrase_id AND pk.project_id = 26
        WHERE ak.article_id = a.id
        )
        ;

I really like this solution as it decouples the existence predicate entirely from the main query! — Andronicus, Apr 26 '20 at 16:31

Andronicus · Answer 3 · 2020-04-26T04:50:34.120

1

Since @Nick already posted a great answer, I'll share something that does not answer your question, but is most performant:

select distinct a.id, a.summary
from article a
join articles_categories ac on ac.article_id = a.id
join articles_keyphrases ak on ak.article_id = a.id
and exists (select * from projects_categories pc where pc.category_id = ac.category_id and pc.project_id = 26)
and exists (select * from projects_keyphrases pk where pk.keyphrase_id = ak.keyphrase_id and pk.project_id = 26)

edited Apr 26 '20 at 04:50

answered Apr 26 '20 at 04:25

Andronicus

25,419
17
47
88

neither `projects_categories` nor `projects_keyphrases` contain `article_id`s. – Travis M Apr 26 '20 at 04:32
@TravisM ah ok, I misread the query and thought it's a cross join with what is in subqueries, corrected – Andronicus Apr 26 '20 at 04:36
and the reason this is more performant is because it is just checking against the conditionals, as opposed to joining everything? – Travis M Apr 26 '20 at 04:51
@TravisM whether this is more performant than 4 joins will depend very much on your data and table structures. Google `sql join vs exists performance` yields a *lot* of interesting reading. – Nick Apr 26 '20 at 04:54
@TravisM postgres checks for existence in the subquery, so when only the match is found, the predicate is met and it moves on, it doesn't have to join the whole table just to check whether the row exists, check out this article: https://blog.jooq.org/2016/03/09/sql-join-or-exists-chances-are-youre-doing-it-wrong/ – Andronicus Apr 26 '20 at 05:10

How to express this SQL query without subqueries

3 Answers3