selecting based on all conditions being met (relational division)

Question

I have the tables question, topic and question_has_topic (many-to-many relationship). In my application admins see a breakdown for questions grouped by their topics and they select how many from each they'd like the system to randomly select to create a test.
This is the kind of table they see:

+-----------------------+---------------------+------------+
|        Topics         | Questions available | Selection: | 
+-----------------------+---------------------+------------+
| health,safety,general |                  13 |            |
| health                |                   3 |            |
| safety                |                   7 |            |
| general               |                   1 |            |
+-----------------------+---------------------+------------+

The count is unique for the particular grouping of topics. Anyway, once they make the selection I need a SQL statement which will select questions that correspond to the given grouping of topics. I.e. I might need 3 questions which have the topics health,safety and general.
I was doing some research online and I think that what I'm trying to do is known as divide in relational algebra and here is my attempt for an arbitrary grouping of topicids:

select questionid from question_has_topic
where not exists (
    select questionid from question_has_topic
    where topicid not in (8,9,10))

The result is empty, although there are 2 questions in the database that have all these topic ids which tells me this isn't working. I was following the example from this link

See also this question with more than 10 ways to find the result: [How to filter SQL results in a has-many-through relation](http://stackoverflow.com/questions/7364969/how-to-filter-sql-results-in-a-has-many-through-relation), with performance testing (for Postgres, not MySQL). — ypercubeᵀᴹ, Feb 20 '12 at 07:18

user1191247 · Answer 1 · 2012-02-20T09:02:07.780

2

I think this is what you were trying to write but it is a very inefficient way of doing it -

SELECT questionid FROM question WHERE NOT EXISTS (
    SELECT topicid FROM topic WHERE topicid NOT IN (
        SELECT topicid FROM question_has_topic WHERE question.questionid = question_has_topic.questionid
    ) AND topicid IN (8, 9, 10)
);

This is definitely much faster -

SELECT *
FROM question_has_topic t1
INNER JOIN question_has_topic t2
    ON t1.questionid = t2.questionid AND t2.topicid = 9
INNER JOIN question_has_topic t3
    ON t2.questionid = t3.questionid AND t3.topicid = 10
WHERE t1.topicid = 8;

UPDATE: I knew there was a simpler answer. Cheran's method is much simpler and should run slightly faster than the INNER JOINs. Please accept his answer.

edited Feb 20 '12 at 09:02

answered Feb 20 '12 at 02:53

user1191247

10,808
2
22
32

Thanks that's what I was looking for. I was doing it the way you suggested in the second solution previously -- I didn't realise this is the more efficient solution though. – artful dodger Feb 20 '12 at 06:54
The form of relational division that you were trying to perform is only really suitable when checking for the existence of an entire set such as in the example you linked to. So, relating it to your schema, it would be useful to find questions that have EVERY topic. – user1191247 Feb 20 '12 at 09:07
This method, with `JOIN` is usually faster than the `GROUP BY`. But it depends on the data distribution, how many questions you are quering for (3 or 20?), etc. – ypercubeᵀᴹ Feb 20 '12 at 22:38

Cheran Shunmugavel · Accepted Answer · 2013-07-02T05:31:24.013

2

EDIT: removed my old post since I misread the question.

This is a technique I've used in the past:

  SELECT qht.questionid
    FROM question_has_topic AS qht
   WHERE qht.topicid IN (8,9,10)
GROUP BY qht.questionid
  HAVING COUNT(*) = 3 AND
         COUNT(*) = (SELECT COUNT(*) FROM question_has_topic AS dupe
                     WHERE dupe.questionid = qht.questionid)

where 3 corresponds to the number of topics in the given group. This assumes that each (questionid, topicid) pair in question_has_topic is unique (which it should be in a many-to-many relationship table).

The way this query works is by first selecting any question that has at least one of the desired topics assigned to it (WHERE qht.topicid IN (8,9,10)), then grouping by the questionid. The first HAVING clause (COUNT(*) = 3) can only be true if a given question has all three topics assigned to it (since we assume duplicates aren't allowed in this table). The second HAVING clause checks the total number of topics that are assigned to the question. This is to guard against the case where, for example, a question may have topics 8, 9, 10, and 11 assigned to it.

edited Jul 02 '13 at 05:31

answered Feb 20 '12 at 06:07

Cheran Shunmugavel

8,319
1
33
40

I've already got this done and wasn't what I was looking for. My bad though, I didn't explain the problem very clearly. – artful dodger Feb 20 '12 at 06:52
@artfuldodger: whoops, I should have read more carefully. Check my edit. – Cheran Shunmugavel Feb 20 '12 at 07:06
@artfuldodger: I woke up this morning and realized that there needs to be another condition to prevent, for instance, matching a `questionid` that has topics (8,9,10,11). – Cheran Shunmugavel Feb 20 '12 at 15:26
I appreciate it. Thanks. I'll use this method but lets say I select one set for the user's selection i.e 3 questions from topics 8,9,10, but after that I need say 10 questions just from topic 8 -- to avoid re-selecting the same question I keep track of used ids and use a where not in (usedids) clause to prevent it from happening. Is there another way? – artful dodger Feb 20 '12 at 15:38
You could achieve that effect by adding `and questionid not in (usedids)` to the `having` clause. Since `having` is evaluated after the `group by` clause, it wouldn't interfere with the intent of this technique. – Cheran Shunmugavel Feb 21 '12 at 06:58
@CheranShunmugavel Can you explain in a bit more detail please what the "having count(*) = 3" has on the query? I noticed if I leave this off then the 'where in' clause acts just like an "OR" clause for each of the topicids. – user658182 Jul 01 '13 at 13:26
@user658182, I added some explanation to the answer. the `HAVING COUNT(*) = 3` clause ensures that the selected question contains all three of the desired topics. – Cheran Shunmugavel Jul 02 '13 at 05:32
For me this gave empty result, but when I changed the two COUNT s to COUNT(*) > 3 AND COUNT(*) <= (SELECT COUNT(*) (..etc) , then it worked. So I wrote > 3 if the number of topics were 4. – Galivan Oct 31 '18 at 16:24

selecting based on all conditions being met (relational division)

2 Answers2